Luca Citi,
Riccardo Poli, and
Caterina Cinel
BrainComputer Interfaces Lab,
School of Computer Science and Electronic Engineering, University of Essex, UK
BrainComputer Interfaces (BCIs) measure specific (intentionally and unintentionally induced) brain activity signals and translate them into device control signals [1,2].
Many factors limit the performance of BCIs. These include: the natural variability and noise in the brain signals measured; the limitations of recording devices, of the processing methods that extract signal features and of the algorithms that translate these features into device commands; the quality of the feedback provided to the user; the lack of motivation, the tiredness, the limited degree of understanding, the age variations, the handedness, etc. of users; the combination of mental tasks used; and the natural limitations of the human perceptual system [3].
While amplitude and shape variations in brain waves are often a hindrance for BCIs, in some cases they carry information which can be exploited to improve performance, irrespective of whether a physiological explanation for such variations is available [3,4]. Some of these variations are hard to document because standard averages may severely misrepresent what really goes on in the brain on a trial by trial basis [5,6,7,8], although more sensitive averaging techniques have recently become available [9,10].
In this paper, we will document, theoretically model and exploit a source of waveform variations in P300 waves, namely, their modulation caused by variations in the interval between targetstimulus presentations. We illustrate our ideas using the Donchin speller, although the benefits of the approach can be accrued also in other P300based systems.
The paper is organised as follows. In section 2 we provide some background on P300 waves and on the factors that may affect their characteristics, with particular attention paid to effects associated with the timing of stimulus presentation. In the same section we also introduce the Donchin speller paradigm. In section 3, we document the effects of the number of nontargets preceding a target on the amplitude and shape of P300s. In section 4, we consider a typical twostage implementation of Donchin's speller, where one stage processes the responses to single stimuli while the other integrates such responses and decides which character to enter. We then derive mathematical models of the two stages, which, together, express the speller's accuracy, and generalise such models to make them applicable to a variety of ERPbased BCIs. Section 5 extends the model of the first stage of Donchin speller so as to consider the modulations of P300s associated with stimulus history documented in section 3. In section 6, we propose and model a new algorithm for the second stage of the speller that exploits such effect making it possible to compare its performance to that of the standard speller. The analysis (corroborated by Monte Carlo simulations) indicates that our algorithm is superior to the top performing Donchin speller from the literature. In section 7 we verify the predictions of our mathematical models by using two independent datasets: one from BCI Competition III and 12 subjects tested in our labs. Results show that the new system is statistically significantly superior to the original. We discuss and position our results in section 8. Finally, we present some conclusions and indicate promising avenues for future work in section 9.
Among the different approaches used in BCI studies, those based on event related potentials (ERPs) [11,12,13] are rather appealing since, at least in principle, they offer a relatively high bitrate and require no user training. ERPs are relatively well defined shapewise variations of the ongoing EEG which are elicited by a stimulus and are temporally linked to it. ERPs include an exogenous response, due to the primary processing of the stimulus, as well as an endogenous response, which is a reflection of higher cognitive processing induced by the stimulus [14].
The P300 wave is an endogenous ERP with a latency range of approximately 300 to 600 ms and which is elicited by rare and/or significant stimuli (visual, auditory, or somatosensory). Effectively, the presence of P300 ERPs depends on whether or not a user attends a rare, deviant or target stimulus. This is what makes it possible to use them, in BCI systems, to determine user intent. For an overview of the cognitive theory and the neurophysiological origin of the P300 see [15].
The characteristics of the P300, particularly its amplitude and latency, depend on several factors [16]. Some factors, such as food intake, fatigue or use of drugs, are related to the psychophysical state of the subject [17]. Other factors, such as the number of stimuli, their size and their relative position, are related to the spacial layout of the stimuli [18,19,20]. Finally, there are several important factors related to the timing and sequence of stimuli. Since these are the focus of this paper, we will devote the rest of this section to summarise what is known about such factors.
Several studies have reported that P300 amplitude increases as target probability decreases (see [14] for a review). The P300 amplitude seems to also be positively correlated with the interstimulus interval (ISI) or the stimulus onset asynchrony (SOA) (e.g., see [21,22,23]). Other studies [24,25] note that, despite the P300 being clearly affected by targetstimulus probability and ISI, each of these factors also varies the average targettotarget interval (TTI) and hypothesise that TTI is the true factor underlying the P300 amplitude effects attributed to target probability, sequence length, and ISI. The P300 is also sensitive to the order of nontarget and target stimuli, most likely because this temporarily modifies targetstimulus probabilities and TTIs. Indeed, there is a positive correlation between P300 amplitude and the number of nontarget stimuli preceding a target (e.g., [26,27]).
In order to avoid the overlap between two consecutive epochs, in psychophysiology the effects discussed above have always been studied with SOAs of approximately one second or more. However, such long SOAs would drastically limit the information transfer rate in BCIs, where SOAs of less than 300 ms are much more common. Also in these cases there is a reduction of the P300 amplitude when two targets are close in a sequence of stimuli. However, this is more likely the result of phenomena like repetition blindness or attentional blink [3] than the TTI effect documented with slower protocols.
The influence of the timing and sequence of stimuli preceding a target on the P300 could be partially explained as the result of ``recovery cycle'' limitations in the mechanisms responsible for the generation of this ERP [24]. The smaller potentials produced when a target is presented shortly after another target might simply be the result of the brain not having yet reacquired the necessary resources to produce large ERPs.
Some ascribe such effects to an inability to consistently generate P300s in the presence of short TTIs [28], suggesting that a smaller average amplitude might be the result of an increase in the fraction of responses to target stimuli that do not present a P300 rather than a modulation of the P300 amplitude in all responses.
In summary, the literature indicates that the order and timing of stimuli have measurable effects of P300s. In an effort to capture and exploit some such effects to improve BCI performance, in this paper we will focus on the number of nontargets preceding a target stimulus. Note that we make no claims regarding whether or not such a number is the most primitive factor affecting P300 amplitudes. All we need for our method to improve performance is that P300's amplitude be correlated with this number, which is the case for Donchin's speller, as we will show in Section 3.
Donchin and Farwell [11] designed a speller based on the P300 ERP, which is now in widespread use. In this speller, users are presented with a 6 by 6 matrix of characters (see figure 1), the rows and columns of which are highlighted for a short period and one at a time. The sequence of flashes is generated randomly without replacement, i.e., a row or column is not allowed to flash again until all other rows and columns have flashed. The users' task is to focus their attention on the character they want to input. Thus, the flashes of the row and column containing the desired character represent target stimuli. As they are relatively rare, they elicit a P300 wave. In principle, a BCI could then identify the row and column of the target character, thereby inferring it, by looking for the largest P300like responses in the EEG signals acquired immediately after the flashing of all the rows and columns, respectively. Unfortunately, noise makes reliably recognising P300s in single trials very difficult. However, by asking users to focus their attention on the character of interest for multiple blocks of 12 flashes, and averaging either the signal epochs related to each row and column or the corresponding outputs of a P300 classifier, it is possible to infer the target character with acceptable accuracy [29].
Evidence of the shape of P300s being modulated by the temporal distance between consecutive targets in the Donchin speller has recently been reported [30,31,32]. In this work, we will take this one step further by modelling and exploiting this phenomenon in order to improve the performance of the speller. Before we do so, however, in this section we want to present experimental evidence further corroborating the results of [30,31,32]. Related experimental data will be used in Section 5.1 to corroborate our model.
We studied the training set of the two subjects of dataset II from the BCI competition III [33] (results with 12 further subjects will be presented in Section 7.2). Data were collected with the Donchin speller protocol described above, using a SOA of 175 ms. For each subject, the training set consisted of 85 blocks. Each block corresponds to a subject focusing on one character for 15 repetitions of the sequence of the flashes of the 6 rows and 6 columns in the display. Further details on the data can be found in [33].
We bandpassfiltered the signals in the band 0.15  5 Hz (HPF: 1600tap FIR; LPF: 960tap FIR) to reduce exogenous components at the stimulus presentation frequency (5.7 Hz) and higher. Onesecond epochs following each flash were extracted. Therefore, for each subject, a total of 15300 (851512) epochs were available, of which 2550 (85152) corresponded to targets.
The set of epochs was partitioned according to the number, (for ``history''), of nontargets presented between the previous target and the stimulus corresponding to the current epoch. For example, for the sequence ...TNNNT..., the second target (T) is assigned to partition t3 because it is preceded by three nontargets (N).
Outlier, e.g., epochs with unusually large deflections due to movements or eyeblinks, were removed using the following procedure. For each set of epochs, the first quartile, , and third quartile, , at each sample, , were found. Then an acceptance ``strip'' was defined as the timevarying interval where is the interquartile range. Responses falling outside the acceptance strip for more than one tenth of the epoch were rejected. For each partition, the mean, , and the standard error, , of the responses for each class were finally evaluated using the remaining epochs.
Figure 2 shows the average responses obtained from the epochpartitioning and artefactrejection procedures described above. The results suggest that despite Donchin's speller being characterised by a fast SOA, significant modulations of the P300 amplitude related to the number, , of nontargets preceding a stimulus are present. The most significant effect is visible around the peak of the ``t'' line, that represents the average P300. Indeed, if we focus on around 450 ms for subject A and 350 ms for B, we see that the P300 is hardly visible in the average associated with the ``t0'' partition which represents two targets in a row. The same is true for the average of partition ``t1''. The averages for other partitions show a progressive increase in the P300 amplitude as the number of nontargets separating a target from the previous target grows.
The effect is particularly clear in figure 3 where we plotted P300 amplitudes vs. for subjects A and B at 450 ms and 350 ms, respectively. Note that, because of the randomisation within each block of 12 stimulus presentations, it is possible, albeit infrequent, for a target stimulus in a block to be followed by up to 20 nontarget stimuli before a target stimulus in the following block is presented.
Subject A Subject B 
An objective of this paper is to model and exploit the P300 modulations documented in the previous section to improve the accuracy of Donchin's speller. As we will see in Section 6, simple modifications to the internal structure of the speller can achieve this. However, before we can proceed with this, we need to describe and model the internal operations of typical Donchin spellers. We do this in this section.
In a Donchin speller, inferring the character the user wants to input is a twostage process (figure 4). The ERPs produced in the 1second epochs following the flashing of the rows and columns of characters in the display are acquired across all EEG channels and time steps, resulting in a sequence of 2D arrays (far left of the figure). In the first stage, each such array is processed by a feature extractor/classifier whose output is expected to contain useful information as to whether the corresponding flash represented a target or a nontarget (flash score). In the second stage, an algorithm combines the flash scores for different epochs, assigns a score to each character (character score) and chooses the character with the highest score as its classification for the data. Below we provide details on how the two stages are implemented in current ERPbased BCIs.
As shown in figure 4(left), the first stage often includes a feature extractor/selector that creates a lowpass filtered and subsampled version of the epoch following a flash and selects a subset of informative samples and channels. Let us call the resulting vector of features characterising the th epoch. Feature vectors are typically processed by a classifier that performs a rotation and a projection of the feature space in the direction that maximises some form of separability measure between the target and nontarget classes. Frequently the classifier is used in an unconventional way: it is trained as if it were a binary classifier with a twoclass output, but it is then used without applying any thresholding as a continuous scoring function, . Let be the flash score produced by such a function in response to feature vector .
As shown in figure 4(right), in its simplest form the second stage can be performed in two steps. The first involves averaging the scores associated with the flashing of each row and column independently and then computing a score for each character by adding up the averages of the corresponding row and column. The second step simply requires choosing the character with the maximum character score [34].
In formulae, in the standard approach, the character at the intersection of row and column is given a score
(2) 
From the description provided in the previous section, it is clear that the accuracy of Donchin's speller depends the performance of the two main stages shown in figure 4. In this section we will construct a probabilistic model that will allow us to compute the overall performance of the speller. We will do so by modelling its second stage in detail, while we will coarsegrain on the internal mechanisms of the first by only modelling its output, i.e., the flash scores. We do this because later in the paper we will modify the second stage so as to exploit target history effects and this will require only knowing and adapting the details of the model of the second stage of the standard speller.
Before we start, we need to model the sequence of stimuli, since the output of a speller and its performance are necessarily functions of the visual stimuli as well as, naturally, of the user's intentions. We model the sequence of flashes presented to the subject as stochastic vector . Each component of the vector is itself a stochastic variable drawn from the set of outcomes where represents the event ``the th row flashed'', while represents the event ``the th column flashed''. Let us denote with an instantiation of and with the instantiation of the element/variable .
The sequence of stimuli needs to be treated as a stochastic variable rather than a constant because of the randomness in the sequence generation process described in section 2.2. Thus, in different repetitions of the experiment different sequences of stimuli will likely be presented to the user. This will happen, for example, if the user needs to enter different letters or the same letter multiple times.
ERPs are modulated by the nature and type of the stimuli eliciting them. However, even in the presence of identical external stimuli, ERPs are typically affected by a great deal of noise and variability. As a result the feature vectors, , and the flash scores, , produced in the first stage of Donchin's speller should be treated as variable, or, more specifically, as stochastic functions of the variables introduced above.
In principle to fully model the first stage of the speller which corresponds to the process of assigning a score to each flash, one would need to identify the joint probability distribution of the variables , , .... as a function of . However, this is very hard to do. Therefore, here we will introduce some simplifying assumptions, which, as we will show later, still produce accurate models. Firstly, we will assume that all the scores that are associated with the flashing of a row or column containing the target character are independent and identically distributed, and that the same is true for the scores produced in response of nontarget flashes. Secondly, if we denote by any stochastic variable that represents the flash score in the presence of a nontarget and by any variable representing the flash score in the presence of a target, we further assume that such variables have different means, which we will denote with and , respectively. Thirdly, we assume that, other than for the mean, such variables have the same distribution, i.e., they can be represented as
Naturally, exactly which variables in a sequence are of type and which are of type depends on whether the th flash is that of a row/column containing the target character or not, respectively.
Based on the observations above, it is clear that the character scores, , computed in the second stage of the speller via (1), are averages of stochastic variables (the flash scores, ) and, thus, are stochastic themselves. Randomness in the is not only due to the 's but also to the fact that the sets of indices and in (1) are themselves stochastic. In fact, as one can immediately see, if we formalise their definition, i.e., and , they are functions of the stimulus sequence .
An important consequence of the 36 character scores being stochastic variables is that one cannot say which character has the maximum value: in principle in different instantiations of the 36 scores different characters could present the maximum score. If the target character is at position , the returned character will be correct if and only if
Let us start by rewriting (1) in terms of stochastic variables:
The random variable has a distribution that depends on whether or not . Specifically, if , the variables in (7) are all distributed like . Therefore, is the average of independent instances of and, so, it is a normal variable with and . Similarly, if , the variables in (7) are all distributed like . Thus, if we denote with any non target row, is a normal variable with and .
Using similar arguments, it is easy to show that the variables and are distributed like and , respectively.
Because of the independence of the variables and the geometry of the speller's grid, for any the set of averaged in and the set of the averaged in are disjoint. The same is true of the variables. In other words, the 6 variables and the 6 variables are all mutually independent. Therefore,
Let us start with . Having already computed the distribution for , first, we need to compute the distribution of . To do so, we exploit the fact that any number greater than the maximum of multiple variables is greater than any individual variable. Therefore, for independent variables, the cumulative distribution function of their maximum is the product of the cumulative distribution functions of the original distributions. That is
We are now in a position to compute . In general, given two independent random variables and , the probability of is
Using the same approach one can easily compute . By substituting that result and (11) into (10), we finally obtain the following expression for the accuracy of Donchin's speller:
While the system described and modelled above is the most typical incarnation of Donchin's speller, also other types of ERPbased speller have been considered in the literature. By using procedures similar to the one we followed to obtain (12), we analysed a variety of such systems and modelled their accuracy. The spellers we considered are listed in the first column of table 1. These include Donchin spellers with different matrix sizes and different features for the distributions of flash scores for targets and nontargets, a sequential speller, and a generic multichoice selection BCI.
Interestingly, we found that the accuracy of all such systems is represented by the following general equation:
In section 4.2.1 we introduced a simple model of the first stage of Donchin's speller, which, in conjunction with a model of the second stage, was then used to derive an expression for the accuracy of the speller. The focus of this paper is really the second stage of the speller. However, the accuracy of the second stage, and, hence, of the speller, can only be evaluated if the statistical properties of the scores produced by the first stage are fully specified. In this section we want to extend the model of the first stage developed above so as take the modulations of P300s associated with stimulus history documented in section 3 into account. In section 6 we will propose and model a new algorithm for the second stage of the speller which will exploit such effects.
So far we have modelled the flash scores produced by the first stage of the speller as a normally distributed stochastic variable whose mean depends exclusively on whether or not the flash is a target or a nontarget. As discussed above, however, strongly affects the shape and amplitude of P300s and, so, it is reasonable to expect that this will also affect flash scores for target stimuli.
To take this into account, we propose to modify (4) by creating a separate version of it for each value of to which significant P300amplitude changes can be ascribed. In other words we replace it with the following:
While we have no evidence that significantly affects also the shape and amplitude of the ERPs produced in the presence of nontargets in Donchin speller, in principle, this might happen with particular choices for the shape and timing of the stimuli or in other types of BCI. These situations could easily be modelled by generalising (3) as we did for the targets, e.g., by considering a model of the form
In appendix D we will look at possible causes of TTIrelated modulations of flash scores in the presence of nontargets and we will explain how the theory and methods developed in the paper would be affected by the use of (15). However, as we will show at the end of section 5.3, this level of sophistication is unnecessary for Donchin speller. So, we will continue to use the simpler (3) to model nontarget scores.
Based on the choices we made in the previous section, in order to model the first stage of the speller in the presence of related effects we need to identify 12 parameters: the variance, , of the noise variable, , in (3) and (14); the parameter in (3); and the ten parameters in (14). Before we can proceed with this task, however, we need to choose a particular implementation of the first stage of the speller.
For the purpose of demonstrating the ideas, we decided to use an existing highperformance approach. More specifically, we borrowed the first stage from the work of Rakotomamonjy and Guigue [29] which won the BCI Competition III for the Donchin speller. Namely, we used an approach they called ``Ensemble SVM without channel selection'' which is easy to implement and outperforms other alternatives when using only 5 repetitions of the 12 stimuli to classify a character. In this approach, an ensemble of classifiers is used. The datasets are split into several subsets and a linear support vector machine (SVM) classifier is trained on each of them. When classifying unseen test data, the outputs of all classifiers are averaged to give a flash score representing the ``targetness'' of the flash.
We analysed the distribution of the flash scores produced by this approach on the BCI Competition data used also in section 3. As in [29], we divided the 85 data blocks of each subject into 17 subsets of 5 blocks each. To obtain realistic generalisation statistics and avoid overfitting problems, we applied a leaveoneout approach estimating the output of the ensemble classifier on a subset of data by averaging the output of 16 classifiers which were not trained using that subset. By repeating this estimation process for each of the 17 subsets we obtained a reliable estimate of the output of the ensemble on the full dataset of each subject.
We partitioned the data into two classes based on whether the corresponding input epochs represented targets (t) or nontargets (nt). We represented the class of a flash as . Then, we divided each class into subclasses according to , for . We, therefore, obtained a total of partitions, each represented by a tuple .
We computed the 12 parameters of the model of the first stage described in section 5.1 as follows. First, we evaluated the mean and variance of the flash scores associated with each partition using the 5%truncated mean, , and 5%truncated variance, . These are robust measures of central tendency and spread which are less sensitive to outliers than their nontruncated counterparts. We then computed the average of the variances of the partitions
We performed this model identification process independently for the two subjects in the dataset II of the BCI Competition III. The resulting parameters are reported in table 2. Note how in both cases markedly increases as increases, as also confirmed by linear regression. For example, the difference in flash score between nontargets and targets for is twice as much as for . As the classification's accuracy strongly depends on the distance between the two classes, it is clear that epochs with small (i.e., one target presented shortly after the previous one) represent an unfavourable condition for the speller.
Subject  



A  
1.183  1.188  0.936  0.867  0.730  0.566  0.669  0.588  0.494  0.534  
1.846  
0.982  
B  
0.824  0.783  0.428  0.490  0.313  0.417  0.165  0.212  0.139  0.198  
1.719  
0.876 
Having defined the model for the first stage and determined its parameters, we need to assess whether this model fits the actual flash scores resulting from the datasets.
We used a KolmogorovSmirnov test to check whether we could reject the hypothesis that the flash scores in each partition could be the result of sampling from the corresponding distributions in (3) and (14) with the parameters in table 2. Results indicated that we cannot reject this hypothesis for 19 out of 20 partitions (we discuss the possible reasons for the exception, partition nt1, later in this section). This confirms the substantial correctness of our assumption that the variables in (3) and (14) are normally distributed and all have the same variance.
To visually evaluate the goodness of the fit between our models and the actual flash scores produced by the ensemble SVM, we show the QQplots [37] associated to each partition of each subject in figure 5. They confirm that model fits very well the data in the key part of the distributions, i.e., between the 5th and the 95th percentiles.

It is natural to ask why the KolmogorovSmirnov test indicated that the data in partition nt1 are unlikely to be the result of sampling from the stochastic variable (i.e., (3)). Obviously, with the standard confidence level of 5% which we adopted we should expect a statistical test to give incorrect answers on average once in 20 applications simply because of stochastic effects. However, this does not appear to be the main reason for this effect. The most likely cause is that the mean score for the epochs in partition nt1 is slightly smaller than for other nontarget partitions, as shown in figure 6 which plots the average score for targets and nontargets for different values. In D we provide a possible explanation for this anomaly of nt1.
Irrespective of the causes of the minor anomaly of partition nt1, it is clear from the plots in figure 6 that the assumption that nontarget score averages are not substantially influenced by is essentially correct for Donchin's speller. Thus, the adoption of a more sophisticated model (such as (15)), over the simpler (3), is unnecessary.
In this section we will propose a new classification algorithm for Donchin's speller that exploits the flashscore modulations modelled and documented in section 5. This uses a modified second processing stage (where the system combines the flash scores into character scores and makes a final decision) which, in turn, will require some modification to the model of the accuracy of the speller presented in section 4.2.2.
Let us start by introducing some useful notation. Let be a set of integers representing the indices of some events in and let be a function that returns the number of flashes between the th flash and the last event in preceding such a flash, i.e., . We want to apply this function to , which represents the set of indices of events where either the row or the column of the character at position in the charactergrid of Donchin's speller flashed.
Let us denote by and the row and column containing the target character, respectively. Then, if the th flash is a target, gives the value of for the th flash, i.e., the number of nontargets preceding the target flash . Note, however, that in general the coordinates of the target character are unknown to the speller. So, the function should not be interpreted as representing the true of the th flash, but as its value on the hypothesis that the target character is at position . To reiterate, only when scoring the target character , the function gives the true value of the th flash.
To account for the effects of , we propose to modify the scoring formula in (1) by weighing flash scores based on the values returned by the function . The scoring function we used is as follows:
In the version of Donchin's speller considered in this paper, the codomain of the function is the set of integers (see discussion at the end of section 3). So, the algorithm only evaluates the function on points in that set. Thus, to fully specify the behaviour of the algorithm in (16) we need to know the value of the parameter and the (codomain) values taken by the function for domain values in the set . In other words, our scoring technique has a total of 22 degrees of freedom which need to be fixed before the algorithm can be used. We will do this in section 6.3 by using the theoretical model of (16) developed in the next section.
In section 4.2 we modelled the accuracy of the standard Donchin speller disregarding the fact that flash scores are affected by . We now need to modify that model to consider both the influence of on targetflash scores (which we studied in section 5) and also the use of the weighted scoring function in (16).
The procedure we used to derive the new model follows approximately the same strategy as in section 4.2: we need to define and study two stochastic variables representing the weighted sum of row and column flash scores, respectively, and then work out with which probability the row and column containing the target will have a higher score than any other row and column.
Since the calculations involved are quite laborious, we provide the detailed treatment in A. Here it is sufficient to note that the final model of the accuracy of the speller in the presence of related effects and using a weighted characterscoring function (16) is also a member of the family defined in (13), as was the case for the ordinary speller, although, naturally, the parameters and are quite different in the two cases (see table 1, rows (a) and (f)).
In principle, the probabilistic model of the accuracy of a Donchin speller controlled by the scoring function (16) developed in the previous section and in A gives us the opportunity of optimising the algorithm's parameters in such a way as to maximise performance. However, in practice, optimally identifying the 22 parameters of the algorithm (and of the corresponding model) from empirical data without incurring in overfitting problems would require very extensive datasets, which are rarely available in BCI. Below we describe two workarounds for this problem and the reasons for our particular choices.
In all the data available to us, it is clear that the effect of on the P300 saturates for large enough values of (e.g., see figures 2 and 3). So, our first step to reduce the number of degrees of freedom of the algorithm was to clip the values returned by the function to an upper value of , thereby reducing the domain values of the function in (16), and, thus, its degrees of freedom, to 10 from the original 21, without significantly affecting performance. Our second step was to adopt a particularly simple form for which has only three degrees of freedom, thereby making a reliable model identification possible even from relatively small datasets. More specifically, we chose to be a linear splinedefined by the knots (abscissas) and the corresponding ordinate values . With these choices, optimising (16) requires finding appropriate values for a parameter vector .
We chose this particular form for for two further reasons. Firstly, our representation of ensures that the standard Donchin speller is part of our search space. It is easy to see this by using the assignment , which corresponds to setting and all the weights in (16) to 1. In this case, (16) collapses down to (1) and the model of the accuracy of the speller developed in the previous section gives us the accuracy of an ordinary Donchin speller in the presence of related effects, which, however, are not exploited when scoring characters (a situation depicted in row (g) of table 1). So, if for whatever reason using a weighted scoring function is not a good choice for a particular BCI, an optimisation algorithm can easily discover this and fall back on the standard scoring method. The second reason for considering functions in the neighbourhood of the constant function is a technical one. As discussed in A.3, with this choice the estimation errors produced by our model due to simplifying assumptions are minimum. So, the optimal parameters for the algorithm obtained by via our analytic approach are more likely to be close to the real global optimum for the speller.
Having constructed theoretical models of the accuracy of the standard speller and our speller, it is easy to see whether, with an appropriate choice of the parameter vector , the use of (16) can improve performance w.r.t. using (1). All one has to do to study the relative performance of the two spellers is to evaluate the difference between the values taken by (13) when using the parameters in rows (f) and (g) of table 1, respectively.
For space limitations, here we cannot provide full details on such a comparison: even with the constraints imposed on the new algorithm in section 6.3, we still have a 5dimensional parameter space: the 4 components of plus the repetition number, . However, below we will look at performance differences in a representative subset of the parameter space.
The contour plots on the left of figure 7 show the difference between the accuracy (percentage of correctly spelled characters) obtained when using a weighted scoring function and the accuracy of the standard Donchin's speller for the two subjects in dataset II of the BCI Competition III. The figure represents the subset of parameter space where , and , which corresponds to ensuring that the average output for a balanced data set containing the same number of targets and nontargets is zero. With these simplifications we are left with only two degrees of freedom, and , and we can, therefore, visualise accuracy differences using a plot. Specifically, in figure 7(left, top and bottom) we studied how the performance of the two spellers differed as the parameters and took all possible values in the set .
Interestingly, we see that there are choices for and , where the use of a weighted scoring function worsens performance. However, an appropriate choice of and can produce significant performance improvements: up to approximately 4% for subject A and up to approximately 1% for subject B. Even bigger improvements can be obtained if one allows (as we will in section 7) the independent optimisation of the 4 components of and the repetition number, .
In section 7 we will indirectly corroborate the veracity of the performance predictions made in the previous section by our model by testing the weighted scoring method experimentally on two independent test sets. However, before we do that, we would like to more directly and independently validate the model of our second stage and confirm its predictions. In this section we will do this via Monte Carlo simulations.
To keep the effort of this expensive form of validation under control and for an easier visualisation of results, we again considered a subset of the 5dimensional parameter space of the system. Namely, we set , and (as we did in the previous section) and ran a Monte Carlo simulation for each possible pair .
The simulation was performed as follows. For each we generated 100 million simulated sequences of stimuli using the following algorithm:
Figure 7(right, top and bottom) shows the results of these Monte Carlo simulations as contour plots, side by side with the results estimated through our analytical model, for our two subjects. Let us analyse these results.

Firstly, we should note the striking similarity between the contour plots obtained from the model and those resulting from Monte Carlo simulations. Apart from minor numerical differences, for each subject the shape and number of the contour lines match very closely across plots. Secondly, for both subjects, Monte Carlo simulations confirm that by appropriately weighting the flash scores, i.e., using (16) instead of (1), we can improve the accuracy of the speller considerably. Thirdly, the peaks of the performance improvement in the plane predicted with the two methods are also extremely close. All of this strongly corroborates the validity of our models.^{1}
Having corroborated the model presented in section 6.2, in this section we will use it to optimise weighted scoring functions so as to improve the accuracy of the speller in two sets of experiments. Tests will be carried out with independent datasets which will not be used in the setting of any of the parameters of the algorithm.
In sections 6.4 and 6.5, for analysis and visualisation purposes, we restricted the exploration of the parameter space to only two degrees of freedom. However, since the model computes estimates of the accuracy of the speller very efficiently, we can evaluate many tentative parameter settings. Therefore, in this section, for each subject we will allow the exploration of the full 5dimensional parameter space defined by the parameter and the four components of .
For each subject in dataset II from BCI Competition III and for each value of , we determined the vector which resulted, according to the model, in the best accuracy for the speller. The parameters of the algorithm/model were determined only using the training data from dataset II. Then the optimal weighting function was tested on the test set of the BCI competition. To reiterate, this set was never used during any phase of the design of the system, nor for the training of the classifier or the weighting function. So, results with this test give a true indication of the generalisation capabilities of the approach.
Table 3 and figure 8 show the results of our tests, reporting the accuracy of the speller on the test set for the two subjects when using the optimised weighted scoring function. We also report the results for the case , i.e., when using the standard character scoring method, which corresponds to the original method in [29]. Let us analyse the results.
Subject  
Algorithm  





13  46  79  
A  
opt  
35.3  71.7  84.7  93.3  98.0 



36.3  67.3  82.7  89.7  
B  
opt  
58.0  80.7  92.0  96.0  96.7 



57.0  78.7  90.0  95.7 
For very small or very large values of the improvement provided by a weighted scoring function is small or even slightly negative. For short sequences (), this is probably ascribable to the fact that some of the approximations in the model become less accurate when is very small (see A.2). Note that this does not imply that a weighted scoring function would not help with small values of : it simply means that, for this particular case, the parameters the model suggest for the scoring algorithm are suboptimal. The weighting has also a small effect for long sequences, where the performance of both scoring techniques reaches 95% ( for A and for B) and, so, there is little room for improvement.
However, for intermediate values of , the new method performs markedly better than the method in [29], i.e., the best performing Donchin speller published to date. More specifically, in the range of values where the accuracy of the speller is between 70% and 95%, there is an improvement of around 3% on average. This is particularly important because 7095% is the accuracy range for which the usability of a speller is highest. A speller with an accuracy below 70% can be frustrating for the user [38] and can prevent recipients from understanding messages. Instead, an accuracy exceeding 95% only marginally improves the understandability of messages while markedly increasing the time needed to communicate, which can be equally frustrating.
In order to confirm the applicability and benefits of the new approach on a larger and independent group of subjects, we tested the algorithm within our BCI lab on 12 further subjects. In this section we report the details and the results of these experiments.
Each participant was asked to spell a total of 20 characters. Target characters were randomly chosen before the beginning of the session. Each row and column of the standard 66 matrix of characters was randomly intensified without replacement for 100 ms with a gap of 50 ms, leading to a SOA of 150 ms. During the spelling of a character, each row and column flashed 10 times, for a total of 120 flashes, between targets and nontargets. During that period, subjects were asked to focus on the target character and to mentally count the number of times it was highlighted.
Subjects were seated comfortably with their neck supported by a Cshaped inflatable travel pillow to reduce muscular artefacts. The eyes were at approximately 80 cm from a 22'' LCD screen with 60 Hz refresh rate. Data were collected from 64 electrode sites using a BioSemi ActiveTwo EEG system. The EEG channels were referenced to the mean of the electrodes placed on either earlobe. Data were sampled at 2048 Hz, filtered and then downsampled by a factor of 8.
Cross validation was used to assess the accuracy of the new algorithm as well as of the standard, nonweighted version of the speller. More specifically, the dataset of each subject was split into 10 subsets, each including the data related to two target characters. One dataset was then used as a test set while the remaining nine formed the training set. Using the software in [39], one ensemble classifier was built by training one SVM on each of the 9 subsets of the training set. Then, using the same leaveone out approach as in section 5.1, we found the parameters , , and for the model in (3) and (14). Given these parameters, we used the analytical model in section 6.2, to find the optimal vector of parameters . We then used the ensemble classifier and the weighted scoring function associated to to classify the test subset. For reference, we did the same with the nonweighted scoring function. This procedure was repeated using, in turn, each one of the 10 subsets as test set and the remaining nine as training set. The results were then combined so as to obtain an average accuracy measure across the 20 characters acquired with each subject.
Subject  
Algorithm  





1  2  3  4  5  6  7  8  
1  
opt  
45  85  100  95  100  100  100  100  100  100 



50  85  90  95  100  100  100  100  100  
2  
opt  
80  100  95  100  100  100  100  100  100  100 



70  100  95  100  100  100  100  100  100  
3  
opt  
40  70  90  95  95  95  100  95  95  95 



45  65  85  95  90  95  95  95  95  
4  
opt  
55  95  95  100  95  100  100  100  100  100 



60  85  95  100  95  100  100  100  100  
5  
opt  
65  60  85  80  90  90  100  100  100  95 



55  60  85  85  95  90  100  100  100  
6  
opt  
70  90  100  100  100  100  100  100  100  100 



70  80  95  100  100  100  100  100  100  
7  
opt  
10  20  40  65  85  85  80  85  90  95 



15  15  40  60  75  70  80  90  85  
8  
opt  
55  90  95  95  100  95  100  100  95  100 



55  90  95  95  100  95  100  100  95  
9  
opt  
65  90  95  100  100  100  100  100  100  100 



70  80  95  100  100  100  100  100  100  
10  
opt  
35  65  90  90  95  95  95  95  100  100 



35  60  90  90  95  95  95  95  100  
11  
opt  
45  65  80  90  95  95  100  100  100  100 



45  65  75  90  95  95  100  100  100  
12  
opt  
35  75  90  95  95  95  100  100  100  100 



45  80  85  95  95  95  95  100  100 
Table 4 reports the results of this procedure for the 12 subjects and for 10 different values of the number of repetitions, . In total there are 120 accuracy figures in the table. These are multiples of 5% because each subject was tested on 20 characters which tends to make small performance differences difficult to resolve. Nonetheless, our new algorithm performs better than the standard speller in 21 cases and when it does so the accuracy improvement is approximately 7.4%, i.e., between one and two characters more out of 20. Conversely, only in 10 cases the standard speller beats our algorithm and when it does so the performance difference is 5.5%, corresponding to approximately one character. Additionally, 6 out of these 10 cases are for , where we already know from the previous section that our model becomes less reliable and so our algorithm is likely to be suboptimally parametrised. So, overall results are quite encouraging.
As the accuracy figures for different values of are strongly correlated, we could not run tests to verify the statistical significance of the data in table 4 using more than one value of per subject. However, it seemed inappropriate to choose a particular value of , since different algorithms may work best for different 's and also the optimal value of (in terms of usability  more on this below) appears to be subject dependent. Therefore, for statistical testing we decided to take, for each subject, the value of for which either method reached at least an accuracy of 80% which provides a reasonable tradeoff between accuracy and speed (see discussion at the end of section 7.1). Table 5 reports the value chosen with this method and the corresponding accuracy of the two algorithms, for each subject. Note that the data in this table are a subset of the data in table 4.
We ran a paired ttest on the data from table 5. The test confirmed that the accuracy improvement of the weighted scoring algorithm w.r.t. the nonweighted one is statistically significant (=0.014, =2.93) with a 95%confidence interval for the improvement of 1.148.03%. In other words, we obtained significant improvements in accuracy with our own subjects as well. They are compatible with the values estimated by our probabilistic model, the Monte Carlo simulations and with the values obtained on the BCI competition test set.
All this evidence points clearly in one direction: significant benefits can be accrued by considering and exploiting the modulations in P300s induced by stimulus history.



Subject  



1  2  3  4  5  6  7  8  9  10  11  12  
such that acc 80%  
2  1  3  2  3  2  5  2  2  3  3  3  
Algorithm  opt  
85  80  90  95  85  90  85  90  90  90  80  90  


85  70  85  85  85  80  75  90  80  90  75  85 
Amplitude and shape variations in brain waves are often considered harmful for BCIs. However, as we suggested in [3,4], in some cases they carry information which can be exploited to improve performance.
Our system embraces this philosophy, by considering, modelling and exploiting TTIrelated P300 modulations. This is done at two levels. In system's second stage (the character scorer), we use different weighting factors for the responses of the first stage to flashes. The weights are based on flash latencies with respect to some earlier reference event, thereby making it possible for the system to adapt to or exploit history effects on brain waves. We also use a model which takes TTIrelated P300 modulations into account to optimally set the weights and biases of such a system from a training set of data. The model explicitly represents the modulations in the flash scores produced by the first stage in the presence of TTI variations, translates them into corresponding character scores and integrates the results into a formula which gives the probability of correct classification as a function of the system's parameters.
The good results obtained with this approach suggest that the modulations in P300 related to differences in TTI are not necessarily a problem within the Donchin speller paradigm, provided one takes such variations into account when designing the system, as we did, instead of averaging over them.
Naturally, this is not the only way of improving performance in a Donchin speller. Another alternative is, for example, to modify the protocol in such a way to reduce TTIrelated modulations in ERPs. For instance, [40] suggested that performance improvements could be obtained by using stimuli where the letters being flashed were not all in the same row or column. Groups of flashing letters were chosen in such a way as to maximise the distance between the representation of the outputs of the first stage (over multiple repetitions of a complete sequence of flashes) for different characters on the assumption that the firststage classifier was 100% correct. This particular choice provides the system with error correction capabilities which may help in the recognition of characters. The benefits of this idea were particularly clear when using a new type of stimuli where each letter of the matrix was placed in a grey rectangle with either horizontal or vertical orientation and letters were activated by switching the orientation of the rectangle between these two states. It was suggested that these stimuli may generate psychophysiological responses which are less susceptible to TTI modulations than the original flashes. It would be interesting to combine such stimuli with our method to see if the two techniques can work synergistically to further improve performance.
An important question regarding performance is whether this is assessed with online or offline data and analyses. It is clear that the ultimate goal should be to evaluate performance by testing a system online, i.e., in exactly the same conditions experienced by the intended users of the system. However, after weighing the benefits and the obstacles associated with testing our system online we decided to postpone an online comparison to a followup paper and perform an offline analysis at this stage. The main reason is that with an offline analysis we can perform a comparison of the two algorithms on the exact same datasets, which in turn allows the use of a paired ttest to establish the statistical significance or otherwise of performance differences. This test has a greater statistical power than an unpaired test, thereby making it possible to make statistically sound inferences with much fewer subjects. Testing the two algorithms with online systems would have added a noise factor due to the two algorithms classifying different EEG signals and would have prevented us from using of a paired test.
Recent work [41,42] has shown that the performance of P300based spellers is significantly affected by whether users gaze at target letters in addition to focusing their attention on them (overt attention) or gaze at a fixation element of the screen and use covert attention to focus on target letters. Performance is significantly higher in the overt attention case. This is most likely because, in addition to the P300 and other endogenous components being modulated, also visual exogenous ERPs are modulated by overt attention (contrarily to what happens with covert attention). As a result, in the case of overt attention classifiers can exploit extra information when making decisions. As suggested in [42], this implies that resulting BCIs are not truly independent and, thus, future work should attempt to create systems which decouple attention from gaze. In such systems it will be even more important to use strategies such as the methods suggested in this paper to make the best use of the information available in endogenous ERPs, including their modulations due to stimulus history.
Finally, we would like to mention, that in recent work BCIs that rely on P300s evoked through different sensory modalities such as spacial auditory [43] or tactile [44] have been proposed. There is no reason to believe the techniques presented in the paper to best exploit visual ERPs could not be applied also to ERPs generated using stimuli presented in a different modality.
In this paper we first documented, modelled and then exploited, within the context of the Donchin speller, a modulation in the amplitude of P300s associated with variations in the number of nontargets preceding a target. In particular, we specialised the system through the use of a new characterscoring function which, with minimal effort, allows it take such modulations of the P300 account.
We mathematically assessed the potential of the approach by means of a statistical model based on few reasonable assumptions, which we later verified. Validation by Monte Carlo simulations showed that the model is accurate.
Testing the new approach on unseen data revealed an average improvement in accuracy of between 2 and 4% over the best classification algorithm for Donchin speller in the literature [29] at the range of sequence lengths which is most suited for practical use of a speller.
The model and method we developed are quite general and can be applied to a wide spectrum of ERPbased BCIs, including the Donchin speller with different matrix sizes, a sequential speller, a generic multichoice selection BCI, and many others. In future research we intend to explore the possibility of obtaining similar improvements within other BCI paradigms based on P300s, including our own BCI Mouse [4].
The authors would like to thank Mathew Salvaris for his valuable comments and help. We would also like to thank Franciso Sepulveda for his help in revising the manuscript.
This work was supported by the Engineering and Physical Sciences Research Council (UK) under grant EP/F033818/1 ``Analogue Evolutionary Brain Computer Interfaces''.
In our model of the first stage of Donchin speller in the presence of history effects, we assumed that the distribution of flash scores follows (3) for not targets (see section 4.2.1) and follows (14) for targets (see section 5.1), respectively. In this appendix we will formally derive a model of the second stage so as to assess the accuracy of the speller when making decisions after a sequence of stimulus repetitions using the scoring function in (16).
Let us start by considering a single repetition of 12 flashes  one for each row and column  within a longer sequence of repetitions (we will extend the treatment to the multirepetition case later). Then (16) can be rewritten in statistical terms as
If the target character is at position , according to (3) and (14) the variables can be expressed as:
To avoid confusion, in the following, every time two independently instantiated variables appear in the same expression, we will use different Roman superscripts to highlight the fact that they are variables with the same distribution, but not the same random variable.
Because of the variability of , , , , and , the functions and are stochastic. For notional convenience we represent them with a new stochastic variable, , which we need to characterise. To simplify our treatment, we make use of the assumption that the number of flashes occurring between the flashing of two particular rows or columns of interest is independently and identically distributed across epochs. This is a first order approximation because it ignores the constraint that each row and column flash exactly once within each group of 12 flashes. Under this assumption, however, the variables associated with different , , , become i.i.d.
Using , (19) simplifies. However it takes different forms depending on whether we are scoring a target character, a ``halftarget'' (i.e., a character belonging to either the same row or the same column of the target, but not both) or a nontarget. Namely:
The target character will be correctly identified if and only if the stochastic variable takes a value that is greater than the maximum taken by the variables in the set . We want to compute the probability of this happening.
This task is quite difficult. In order to simplify the treatment we will make an approximation which leads to slightly overestimating the accuracy of the speller. Namely, we will assume that misclassifications can only occur because the score of one of the 10 most likely competitors, the ``halftargets'', is higher than the value taken by . In other words, we neglect the possibility of one of the 25 nontargets (corresponding to (23)) causing a misclassification.^{3} We will therefore drop (23) hereafter.
Equations (20)(22) can be rewritten as
When a sequence of repetitions is used to score characters and make decisions, the average, , of instantiations of has to be compared with the corresponding halftarget averages and . These can be expressed by trivially generalising as follows:
The purpose of in our scoring equation is to slightly modulate the effect different values of have on the average, according to their . So we will choose weights that are distributed around 1. Under this restriction, it is safe to assume that the value such that also satisfies . In other words, in (33) we allow the second term to partly contribute to the value of the score but we assume that its value is never so big to change which one of the nontarget rows has the maximum score.
Under this assumption
(34) 
Combining these results we obtain:
We now need to find the distributions of , , , and . In the following we will concentrate on the variables but results also apply to the variables because their distributions are the same.
The variables and depend on , , and . These, in turn, depend on and , which are averages of and variables. So, let us start by finding the distributions of and .
Using the theorem in B, we can express the cumulative distribution functions (cdf) of and as:
Using the theorem in C, from (36) and (37) we obtain:
We now want to find the distributions of the averages of instances of and , namely and , respectively. We will do so by using the central limit theorem.
According to the central limit theorem, the distribution of the average of a sufficiently large number of i.i.d. random variables can be considered as normal and its mean and variance can be estimated from the mean and variance of the distribution of the original variables. In our case the number of terms averaged ranges from a few to up to 15. These numbers may look too small for the application of the central limit theorem. However, results on the rate of convergence of the central limit theorem based on pseudomoments (e.g., [45]) indicate that a faster convergence can be obtained when the original distribution is close to a normal distribution, which is the case in our model.
From (38) and (39), it follows that the probability density functions of and are sums of scaled and shifted Gaussians. If the differences among the values of as well as the differences among the values of are small compared to and the values of and are of the same order of magnitude as , the Gaussians in (38) and (39) are close and merge into a belllike distribution which closely resembles a Gaussian. Hence, the central limit theorem gives reasonably accurate approximations even for .
Using the central limit theorem, and are normally distributed with parameters: , , , .
Through a procedure similar to that in B, we find that the variable has the following cdf:
The considerations made earlier about the central limit theorem apply also to . Thus, we find that is approximately Gaussian with parameters and .
As and are independent and normally distributed, then also is normally distributed. Its parameters are: and .
For the non targets, the distribution of can be found using the same procedure as in section 4.2 obtaining .
We can find the probability that using the same procedure as in section 4.2 obtaining the equivalent of (11).
The very last step is to derive the probability of the conjunction on the right in (35): . Even if the statement (35) is formally identical to (9) with replacing and replacing , we cannot directly write the equivalent of (10) because and are not independent. In fact, they both depend on both and . Their dependence is through and whose means and (as shown in A.2) contain the factor . However, if we restrict the search space to values of and (which affects ) such that is never too different from 1 and is not excessively large, we can assume the independence of and .
Under this assumption we can approximate as obtaining:
Therefore, can be expressed as
The independence of and , and the fact that is discrete, imply that
Using Iverson brackets (see A.1) we can rewrite (41) as
The outer integral performs the convolution between the Dirac function and the inner integral. As
Each corresponds to a function . As is strictly increasing, its inverse exists and is increasing, too. Therefore,
From the definition of variance, we have
In section 5.3 we found that the average flash score for ERPs in partition nt01 (which gathers the responses to the second nontarget following a target) was lower than for other nontarget partitions. In this appendix we look at possible causes of TTIrelated modulations of flash scores in the presence of nontargets and how the theory and methods developed in the paper would be affected by the use of (15).
When stimuli are presented at a sufficiently fast pace, the P300 produced in response to a target stimulus will appear within the epochs corresponding to the presentation of one or more nontargets following that target thereby deforming ERPs. Naturally, not all such deformations will influence a flash scorer. However, some may do so.
The SVM classifier used in the first stage of our Donchin speller weighs the contribution of the different EEG samples in order to maximise the distance between targets and nontargets. For simplicity let us think of it as a kind of matched filter looking for P300like shapes, i.e., positive deflections of the EEG w.r.t. the baseline in a time window of around 300500 ms after stimulus presentation. With the standard stimulus presentation rate of Donchin's speller, for the epochs in the nt01 partition, the P300 induced by the preceding target (the target epoch that started before an nt01 epoch) shifts positively the baseline of the nt01 epochs thereby significantly deforming the ERP and making it less likely to produce a high flash score. Figure 9 illustrates the idea using a simple synthetic signal and an elementary classifier. We believe that this is the most likely reason for nt01 having more negative scores than other nontarget partitions in our tests.
Of course, when using a different timing for stimuli, one might find that other nontarget partitions are affected by this effect. The effect may also depend on the particular choice of stimulus paradigm and flash scoring algorithm. So, we cannot exclude that in different P300 based BCIs a more marked TTI modulation of nontarget ERPs might be observed. For these reasons it is interesting to consider how the theory and algorithms presented in this paper would be modified if we wanted to also model and exploit such modulations.
Building a model where the score distribution of the nontargets depends on would be conceptually straightforward and would pose no new challenges. The model for nontargets could be expressed using (15). Apart from small formal changes in the formulae, the main substantial difference is that (21) and (22) should be rewritten as
The second term in (46) was called in A. Earlier its mean, , was found to be . When using (46), . But, as and are independent, this reduces to the product , where the second factor is the average nontarget score among the partitions, i.e., . In other words, even with a more sophisticated model of nontarget scores, . The same applies to the first term in (47). This shows that a model that assumes the dependence of the nontarget score distribution on would make the mathematical formulation more complicated without giving any substantial advantage.