**Specification
and assessment of methods supporting the development of neural networks in
medicine**

by M.
Egmont-Petersen

Doctoral
dissertation, Faculty of Medicine, Maastricht University, Shaker Publishing
B.V., Maastricht, 1996. (Purchase: http://www.amazon.co.uk
or Shaker Publishing purchase)

**1
Scope of the dissertation**

This doctoral dissertation presents methods and techniques that may expedite application of neural networks in medicine. Research on neural networks started as a branch of neurology. A neural network consists of a set of interconnected nodes (neurons). Each node works like a junction between "nerve" paths. The neuron receives a number of inputs and produces an activation. The activation of a neuron is functionally dependent on the input signals the neuron receives. Each input signal is modified by a weight. Since their introduction by McCulloch and Pitts in 1943, neural networks migrated to cognitive science, artificial intelligence, statistical regression and decision theory, signal processing and other engineering disciplines. Neural networks have been developed for a large number of applications in economy, computer science, telecommunication and medicine. Chapter 1 contains a brief overview of neural networks that have been developed for clinical decision support.

Clinical application of neural networks is problematic because of their black-box nature. It is very difficult to assess the knowledge encoded in the weights of a trained neural network as it constitutes a nonlinear mapping between the feature (input) space and the class (output) space. In the dissertation, different techniques that characterize the properties of a trained neural network are suggested. Thereby, development and verification/validation of neural networks is expedited. To enhance the application of neural networks, the topic of missing data is also addressed.

**2 A
neural network performs a mapping**

The most general notion
of a classifier is a mathematical mapping from an *n*-dimensional input
space to a *c*-dimensional output space, N: R* ^{n}* x R

**3
Assessing the output of a classifier**

In chapter 2, metrics are
defined that characterize the performance of a trained neural network. The
performance is measured by letting the neural network classify a set of test
cases of which the true class label is known. A contingency table (confusion
matrix) is used to characterize the performance of a neural-net classifier.
Existing metrics that characterize different properties of the neural-net
classifier are discussed. Also some new metrics are introduced. Although these
metrics are defined for a neural-net classifier, they can be applied to other
classifiers of which the results can be characterized by a contingency table.
The metrics include *correctness* - the fraction of correctly classified
cases - and *coverage* - the fraction of cases to which the neural network
can assign a class label. The misclassified cases are characterized by the
metrics for *bias* and *dispersion*. Standard errors and confidence
intervals for some of the metrics are specified. The usefulness of the metrics
is explored in a set of experiments in which neural networks are trained for
classification of thyroid disorders.

**4
Assessing the importance of attributes for a classifier**

The chapters 3 and 4 address how to assess the contribution of individual attributes to the performance of a neural-net classifier. The motivation for performing attribute assessment is twofold. First, one wants to obtain insight into which attributes are important for assigning a correct class label to one or more cases. Secondly, one needs a criterion to rank the attributes according to their contribution to the performance of the neural-net classifier, before unimportant attributes can be pruned.

In chapter 3, different
approaches to attribute selection such as forward, backward and
Branch-and-Bound search are discussed. It is argued that backward search is a
suitable selection strategy. Based on a mathematical analysis of a minimal
error-rate classifier, a metric for the *discriminative power* of an attribute
is introduced. This metric is used as a criterion to rank the attributes for
each case in the test set. The ranks for each attribute are summed over all
cases. The summed ranks are compared using Friedman's two-way analysis of
variance. Attributes with a high average rank are unimportant for the neural
network whereas attributes with a low average rank have the most influence on
the classification performance. The usefulness of this approach is assessed in
a number of experiments with artificial classification problems. The
experiments indicated that the approach ranks the attributes correctly, when
applied on classifiers trained with independent attributes as well as on
classifiers trained with dependent attributes. The approach is also used in an
application to identify attributes that are important for discriminating four
different types of texture in radiographs of focal bone lesions.

In chapter 4, a
mathematical framework is developed in which four different feature measures
are derived from a minimal error-rate classifier. Each measure allows one to
compute a lower bound for the *marginal contribution* of a feature to the
performance of a statistical classifier. These measures characterize the *influence*
and *replaceability* of a feature. Influence is the probability that a
feature can possibly change the class label of a case while the other feature
values are kept fixed. Replaceability is the expected decrease in performance
when a feature value is substituted by the conditional mean of the feature.

Each feature measure is
made operational by a feature metric. Computation of three of the four metrics
requires the identification of the attribute-conditional decision boundaries.
The decision boundaries for a given feature depend on the values of the other *n*-1
features and have to be identified in each case. The boundaries are identified
with a piecewise polynomial approximation which is based on a Taylor expansion
of the output of a neural-net classifier as a function of the given feature.

A pruning method called *LMS-pruning*
is introduced. A feature is LMS-pruned by removing the links that connect the
input node of the feature with the hidden nodes and changing the weights that
connect the remaining features with the hidden nodes. The weights are modified
such that the pruned neural network classifies the training cases identically
to a network based on *n* feature values with the value of the pruned
feature replaced by its expected value.

In experiments with artificial classification tasks, the four metrics are compared with respect to their ability to rank the features. These experiments indicate that replaceability is the best ranking criterion. The experiments showed that for neural-net classifiers with a performance close to the minimal error rate, LMS-pruning a feature resulted in a pruned network with a performance that remains close to the maximal (Bayesian) correctness.

**5
Estimation of missing data**

In chapter 5, a method for iterative estimation of missing data is suggested. Statistical classifiers such as neural networks require all inputs to be able to assign a class label to a case. This impedes application of such classifiers in environments where incomplete data frequently occur. Different approaches to estimate missing data such as the EM-algorithm and Multiple Imputation are discussed. To cope with some drawbacks of these two methods, it is suggested to use an auto associator neural network in recurrent mode to estimate missing values. The properties of an auto associator that is trained with complete cases is analyzed. Subsequently, it is suggested to use the auto associator in recurrent mode to estimate missing values. The conditions that ensure convergence of the recurrent auto associator are derived. It is proven that convergence is only possible when the number of hidden nodes of the auto associator is smaller than or equal to the number of observed values in an incomplete case.

The recurrent auto associator is embedded in the Recurrent Expectation Maximization (REM) algorithm, an iterative approach for estimating missing values in a set of cases. In a set of experiments, the residual variance of predictions made by the recurrent auto associator is compared with the residual variance obtained using multivariate linear regression. Also the REM and EM-algorithms are compared with respect to their ability to estimate missing values (residual variance) and to estimate the covariance matrix from the incomplete sample. The experiments indicate that the recurrent auto associator results in poorer estimates of the missing values than multivariate regression. The REM-algorithm estimates the covariance matrix slightly worse than the EM-algorithm when the data are fairly correlated and all variables have identical variances. However, the REM-algorithm gives an indication of those combinations of variables with missing and observed values in which the missing data will be predicted poorly. Leaving out such cases leads to an improvement in the estimation of the covariance matrices by the REM-algorithm.

**6 Classification
from noisy attributes**

In chapter 6, the
influence of measurement noise on the classification of a case is analyzed.
Based on ideas of Brender *et al.*, a quality measure called robustness is
specified. The robustness of a classification is the probability that the class
label assigned to the case would not be different from the classification based
on the (unknown) true attribute values. It is assumed that the measurement
noise is Gaussian with a zero mean and uncorrelated with the attributes. A
formula for the robustness of a classification is specified.

In practice, it is difficult to estimate the robustness of a classification when the probability density function of the uncontaminated attributes is unknown. Therefore, two approximations are suggested. The bias introduced by these two approximations is analyzed for the special situations where an attribute comes either from a unimodal or a bimodal distribution and is to be classified into one of two classes.

A simulation experiment
illustrates how often an attribute has to be remeasured to achieve a robust
classification (the measurements are averaged, which reduces the influence of
the measurement noise on the attribute value). It is clear that remeasuring a
(noisy) attribute makes sense when only a few remeasurements are required to
ensure a classification with a sufficiently high robustness. When, however, the
robustness of a classification becomes too low, the number of measurements that
are necessary to obtain a more accurate estimate of the attribute values
becomes very high. The notion of *remeasuring intervals* is introduced.
Such intervals indicate when remeasuring an attribute makes sense.

**7
General conclusion**

The methods and
techniques developed in this dissertation are explored in a set of experiments.
In chapter 7, it is discussed to which extent these methods and techniques may
support development, verification and validation of neural networks. The
possibility of applying knowledge-based systems in general and neural networks
in particular in the clinic is discussed as well. It is argued that
introduction of such systems in clinical practice interferes directly with the
work processes of physicians. One can expect that such systems will have their
largest potential in low-level information processing. It is an issue for
further research to investigate the value of the presented methods and
techniques in the development and evaluation of neural networks for clinical
application.