Capturing stochastic variations of train event times and process times with goodness-of-fit tests

Page created by Terrence Perez
 
CONTINUE READING
Capturing stochastic variations of train event times and
           process times with goodness-of-fit tests
                                         Jianxin Yuan
          Department of Transport and Planning, Delft University of Technology
        P.O. Box 5048, 2600 GA Delft, The Netherlands, e-mail: j.yuan@tudelft.nl

Abstract
In this paper, several statistical distributions selected for capturing stochastic variations of
train event times and process times are assessed based on real-world data recorded at a
Dutch railway station The Hague Holland Spoor. The assessment is performed using the
Kolmogorov-Smirnov goodness-of-fit test. To achieve a precise assessment of those
candidate distributions, a new approach to fine-tuning the distribution parameters is
proposed. The distributions are compared not only for the arrival and departure times of
trains at the station, but also for the arrival times of trains at the boundaries of the local
railway network and the train running, dwell and track occupancy times in the network.
Furthermore, the selected distributions are assessed for the conditional running and dwell
times as well as track occupancy times of trains in case of (no) hinder caused by other
trains. The analysis results reveal that the stochastic variations of train event times and
process times can be well captured by either the lognormal distribution or the Weibull
distribution. These fitted statistical distributions can be used as input models in the
analysis of timetable robustness and prediction of the punctuality of train operations.

Keywords
Train delays, Running & dwell times, Statistical distributions, Parameter estimation,
Goodness-of-fit

1   Introduction
To analyze the robustness of timetables and predict the punctuality of train operations in a
railway network, the distributions of initial delays at the network boundaries and the
distributions of additional delays within the network are often assumed. Limited
literatures [7], [13], [11], [4], [5] exist for a statistical analysis of the distributions of train
delays based on real-world data. It is doubtful whether an analytical or simulation model
could accurately predict the propagation of train delays and the punctuality of trains
without a realistic description of stochastic variations of the initial and additional delays
[6], [12].
    The initial delay of a train is the difference between the actual event time and the
scheduled time at a network boundary, while an additional delay within the network is the
difference between the actual and scheduled time periods for completing a running or
dwell process. This paper aims to capture the stochastic variations of train event times and
process times using real-world traffic data, i.e. to obtain the distributions of initial and
additional delays statistically.
    For the analysis and prediction of timetable robustness and train punctuality, it is
important to make a distinction between primary delays and knock-on delays occurring in
an operational process. The primary delays of trains may be due to technical failures,

                                                   1
running at a speed lower than scheduled, prolonged alighting and boarding times of
passengers, and bad weather conditions, while the knock-on delays of trains result from
tight headway, route conflicts and late transfer connections at stations. For the primary
delays in train running or dwell processes, it appears that the distributions cannot be
derived directly from track occupation and release records, which can only show the total
delays and may include knock-on delays. Therefore, data filtering is necessary to model
the distribution of primary delays.
    To accurately estimate the knock-on delays of trains suffered before occupying a
certain track junction or station platform, the part of delays due to deceleration and
acceleration in the case of tight headway and route conflicts must be taken into account.
Therefore, the conditional distributions of train running and track occupancy times in case
of hinder are to be studied, too.
    To capture the variability of train event times and process times, several statistical
distributions are first selected and then assessed with goodness-of-fit tests. Following this
brief introduction, the collection of data is outlined. Afterwards, a discussion on the
selection of appropriate statistical distributions and an introduction to the distribution
assessment approach, i.e., the goodness-of-fit test is given. Since a hypothesized
distribution with grossly estimated parameters can be easily rejected by a goodness-of-fit
test especially in case of a large sample size, a new approach to fine-tuning the parameters
of hypothesized distributions is proposed. Furthermore, the distribution fitting results are
shown. The final section of the paper covers the main conclusions.

2   Obtaining data
This statistical analysis is based on the track occupancy and release times of 10,000 trains
which were recorded at a Dutch railway station The Hague Holland Spoor (The Hague
HS). In the Dutch Railways, the train describers, i.e., TNV systems (Trein-
NummerVolgsystemen) keep track of the progress of trains at discrete steps over their
routes. The TNV-logfiles of about 25 MB ASCII-format per day per TNV-system contain
chronological information about all signalling and interlocking events in a traffic control
area. The TNV logfiles give an accurate description of train movements with a maximal
error of 1s, but track section messages are not matched to individual train numbers.
    By posterior analysis of the TNV logfiles, Goverde and Hansen [3] developed the data
mining tool TNV-Prepare that couples events of infrastructural elements to train numbers.
The TNV-Prepare output consists of chronological TNV-tables per train line service and
(sub)route. For each individual train, event times along the route are given, including train
description steps, section entries and clearances, signals, and point switches. From this
information other interesting process times such as running times, blocking times, and
headways can be derived easily. Because the actual arrival and departure times of a train
at a platform stop are not recorded in the TNV records, the tool TNV-Filter [4] was
developed which estimates these event times based on data generated by TNV-Prepare.
The arrival and departure delays as well as the dwell time of each individual train are then
calculated with a precision in the order of a second.
    During the time period of data recording, 24 passenger trains arrived and departed at
The Hague HS per hour corresponding to 9 different train series in both the southbound
and northbound directions. The types and routes of trains may affect the distributions of
train event times and process times. The assessment of those candidate distributions is to
be performed per train series in each direction. A train series in one direction is hereafter
referred to as a studied case.

                                               2
3   Selecting statistical distributions and testing the goodness-of-fit
3.1 Select statistical distributions

An appropriate distribution can be selected based on the physical properties of a random
variable. However, it is often difficult to know all the properties when the variable
represents a result of a great number of complicated processes influenced by human
behaviours. The statistical distribution of a random variable can also be selected by
inspecting the data. In this case, a histogram is a useful representation since it shows the
distribution shape of the data. It is best to use both these approaches to ensure that they are
in agreement [10].
    Train event times and process times generally have a lower bound, i.e., the earliest or
shortest time. Furthermore, the histograms are often skewed to the right especially in the
case of train event times, which are subject to an accumulation of the previous delays. It
appears that the lognormal, gamma, and Weibull distributions, which are very flexible in
the fitting to the shape of the distribution of a data set, may capture the stochastic
variations of train event times and process times with a higher accuracy than other
distributions. A symmetrical distribution without a lower bound such as the normal
distribution cannot be a generic candidate distribution, although this sort of distribution
might approximate the variability reasonably well in some situations.
    The histograms of train arrival and departure times at stations are subject to the
alteration of train orders in real-time operations and thus often have a heavy tail on the
right. It seems that a mixture distribution may better represent the variability since this
sort of distribution enables to incorporate the heavy tail. Güttler [5] fitted a normal-
lognormal mixture distribution to excess running times of trains between two stations
using the data obtained from the German Railway. Note that finite mixtures will introduce
additional complexity because they have more parameters, whose estimation usually
requires solving a complex system of non-linear equations. Moreover, train planners
generally prefer to use a simpler distribution model if it can approximate the reality
reasonably well. Thus, fitting a mixture distribution to the observed event times and
process times will not be studied. Instead, we will pay much attention to conditional
distributions of the running times of trains in case of freely and hindered travelling as well
as the conditional distribution of the dwell times of trains in case of no hinder. This aims
at a provision of input distribution models for predicting knock-on delays and the impact
on train punctuality in railway stations and networks [14].
    Based on the analysis above, the lognormal, gamma, and Weibull distributions are
mainly selected for capturing the stochastic variations of train event times and process
times. The exponential distribution, which is widely used to approximate the variability of
non-negative delays, is actually incorporated since it is a special case of the gamma and
Weibull distributions. The normal, uniform, and beta distributions have been applied in
some literatures [1], [4] and we also consider them as the candidate distributions for
modelling the variability of train event times and process times.

3.2 Test the goodness-of-fit

For assessing the goodness-of-fit of statistical distributions, both graphical approaches and
goodness-of-fit tests can be applied. Law and Kelton [8] presented five main graphical
approaches for comparing fitted distributions with the true underlying distribution. These
approaches include density/histogram overplot, frequency comparison, distribution

                                                3
function differences plot, probability-probability plot and quantile-quantile plot.
    Graphical approaches give a good guide as to the goodness-of-fit and enable us to
weed out poor candidates. Following on from a graphical analysis it is important to
perform statistical tests to determine how closely a distribution fits the empirical data. The
oldest goodness-of-fit test is the chi-square test, which can be thought of as a more formal
comparison of a histogram with the fitted density function. Another commonly used
goodness-of-fit test is the Kolmogorov-Smirnov (K-S) test, which compares an empirical
distribution function with the hypothesized distribution function. For detailed description
as to these goodness-of-fit tests, see Law and Kelton [8].
    Both the goodness-of-fit tests have their own advantages and drawbacks. Perhaps the
main advantage of the chi-square test is that when unknown parameters must be estimated
from the data, a generic correction can be introduced in the statistic distribution by
reducing the number of degrees of freedom. If parameters must be estimated from the data
to apply the K-S test, no general adjustment is available for the critical value of the
statistic distribution. However, the K-S test also has several advantages over the chi-
square test. First, the K-S test does not require to group the data in any way, so no
information is lost; this also eliminates the troublesome problem of interval specification,
which occurs in case of the chi-square test. Secondly, the K-S test is exact for any sample
size and any distribution model when the parameters are not estimated from the data,
whereas the chi-square test is valid only in an asymptotic sense. For several hypothesized
continuous distributions, if the parameters are specified without making use of the data,
the K-S test can be applied to accurately assess and rank the goodness-of-fit of those
distributions. In the following, the K-S test will be applied to compare the distributions
selected for capturing the stochastic variations of train event times and process times.

4   Assessing the selected distributions based on a parameter fine-
    tuning approach
Having selected the statistical distributions for train event times and process times, we
need to further specify the parameters to test the goodness-of-fit. If the parameters of a
selected distribution were estimated directly from the data which are used to obtain the
empirical distribution in the goodness-of-fit test, the critical value of the test statistic
distribution, depending on the type of the hypothesized distribution, could not be obtained
accurately. To assess the selected statistical distributions, we will specify the parameters
without making use of the data directly, and apply the K-S test and rank the goodness-of-
fit. As mentioned before, if the distribution parameters are specified grossly, the
hypothesized distribution can often be rejected especially in case of a large sample size of
the data. Therefore, we propose a new approach to fine-tuning the parameters of the
selected distributions. The assessment of the candidate distributions will be carried out
based on this parameter fine-tuning approach.
    To specify the parameters and test a selected statistical distribution for a sort of train
event times or process times, we first split the data into two subsets randomly, e.g. by
assigning the chronologically recorded times alternately. Using the first data subset, an
initial estimate of the distribution parameters is obtained. Furthermore, we fine-tune the
parameters to improve the goodness-of-fit of the K-S test where the empirical distribution
is, however, obtained from the second data subset. The sample size of these data subsets
has been halved, but the “all-parameters-known” form of the K-S test can be applied. In
this case, the critical value of the test statistic distribution can be accurately estimated and

                                                 4
applicable to all distribution types. In addition, randomly splitting the original data into
two subsets ensures that the distribution fitted well to the second data subset is the
distribution we need for capturing the variability of the original data.
     The parameters of a given distribution can be estimated on the basis of empirical data
using the moment method, maximum likelihood method, and Bayesian estimation. The
maximum likelihood method is widely used in practice because a Maximum Likelihood
Estimator (MLE) is the minimum variance unbiased as the sample size increases [2]. The
estimation of distribution parameters can be influenced by outliers. An outlier is a data
point which deviates from the bulk of the data [4]. In the case of train event times and
process times, the outliers are mostly some very large values, as has been confirmed by
our data analysis. The impact of an outlier depends on the sample size and the spread of
the data. However, there is no exact standard method to detect outliers. A new heuristic
procedure to deal with outliers is incorporated in the following approach to fine-tuning the
parameters of a candidate distribution of a data set:
      • An initial estimate of the distribution parameters is obtained using the MLE based
        on the first split data subset.
      • The large delays in the data subset are omitted iteratively one by one estimating the
        distribution parameters correspondingly using the MLE. In each iteration, we
        compute the p-value of the K-S test where the empirical distribution is obtained
        from the second split data subset. The iterative procedure terminates if the p-value
        cannot be increased any more.
      • After the iterative procedure, we fine-tune the parameter estimate again to further
        improve the estimation by maximizing the p-value of the K-S test, where of course
        the empirical distribution remains unchanged. In detail, a neighbourhood ±1.0 and
        a fine-tuning step 0.1 are considered for the parameters of the lognormal
        distribution as well as the shape parameters of the gamma, Weibull and beta
        distributions. In case of the other distribution parameters, a neighbourhood ±10 and
        a fine-tuning step 1.0 are considered. Herein, the time unit is in seconds.
The parameters of a statistical distribution can be classified, on the basis of their physical
or geometric interpretation, as one of three basic types: location, scale and shape
parameters. If the location parameter of a statistical distribution is the lower endpoint of
the distribution’s range and is nonzero, this location parameter is also called shift
parameter and the distribution is called a location-shifted distribution. Train event times
and process times generally have a lower bound. Therefore, a location-shifted distribution
will be mostly applied. It should be mentioned that we take the shift parameter of a
statistical distribution as the minimal value of the data.
     To assess those candidate distributions, the K-S tests are performed at a commonly
adopted significance level of 0.05. Each type of the distributions is ranked according to
the p-value of the K-S test where the hypothesized distribution is specified with the fine-
tuned parameters. In detail, we give rank 1 to the candidate distribution with the biggest p-
value, rank 2 to the distribution with the second biggest p-value, and so on. The candidate
distribution ranked 1 is considered to be the best one among the selected distributions.
The quality of the distribution fitting for train event times and process times is also
visualized by comparing the fitted distribution density curve with the kernel density
estimate [2] and the histogram and by applying the distribution differences plot for the
fitted distribution and the empirical one.

                                                5
5   Distribution assessment results
5.1 Train event times

The statistical fitting of train arrival time distribution is a prerequisite for predicting the
propagation of train delays in stations. To incorporate the impact of knock-on delays
suffered before occupying the platform track in a delay propagation model, we need to
distinguish the arrival times of trains at the platform track from those at the station
approach signal. We have compared the K-S goodness-of-fit among the candidate
distributions for the arrival times at those locations. It has been found that the lognormal
distribution gives the best fit in 9 and 11 out of 14 studied cases (a case represents a train
series in one direction), respectively. Additionally, the lognormal fits have been accepted
by the K-S test in most cases. It should be mentioned that the lognormal fits have been
specified with a shift parameter corresponding to the earliest arrival time.
    Figure 1 shows the lognormal density fit, the second best, i.e. gamma density fit,
kernel density estimate and empirical histogram for the arrival delays of an intercity train
series in the northbound direction at the platform track. The distribution differences plots
are also given in Figure 2. Both the density fits match with the kernel density estimate and
the histogram rather well. It appears that the lognormal fit is more attractive than the
gamma fit since the former has a greater probability around the distribution mode than the
latter. The differences between the lognormal distribution and the empirical one are
overall smaller than the differences between the gamma distribution and the empirical
one, too.

Figure 1: Lognormal and gamma density fits,         Figure 2: Distribution difference plot for
kernel density estimate and histogram for the       the lognormal fit and the arrival delays of
arrival delays of an intercity train series         IC2100N at the platform track and the
IC2100N at the platform track                       plot for the gamma fit and the delays

Early arriving trains generally have a longer dwell time than late arriving trains. To
estimate the distribution of departure times more realistically by distinguishing the dwell
times of late arriving trains from those of early arriving trains, the distribution of non-
negative arrival delays is needed [12]. We have compared the K-S goodness-of-fit among
the candidate distributions for non-negative arrival delays at the platform track. The
analysis results reveal that the Weibull distribution gives the best fit to the data in 12 out

                                                6
of 18 studied cases and it ranks second or third in the other 6 cases. In addition, the
Weibull fits have been accepted by the K-S test in all the cases.
    Figure 3 shows the Weibull density fit, the second best, i.e. gamma density fit as well
as the empirical histogram for the non-negative arrival delays of an interregional train
series in the northbound direction. It appears that the Weibull density fit matches with the
histogram better than the gamma fit. Figure 4 displays the distribution differences plots.
The differences between the Weibull distribution and the empirical distribution are overall
smaller than the differences between the gamma distribution and the empirical one.

                       0.01                                                                                                   0.2
                                                                Weibull fit                                                                                            Weibull fit

                                                                               Fitted distribution−Empirical distribution
                                                                gamma fit                                                    0.15                                      gamma fit
                      0.008
                                                                                                                              0.1
Probability density

                      0.006                                                                                                  0.05

                                                                                                                               0
                      0.004                                                                                                 −0.05

                                                                                                                             −0.1
                      0.002
                                                                                                                            −0.15

                         0                                                                                                   −0.2
                          0   60 120 180 240 300 360 420 480 540 600 660 720                                                     0   60 120 180 240 300 360 420 480 540 600 660 720
                                  Non−negative arrival delay IR2200N [s]                                                                 Non−negative arrival delay IR2200N [s]

Figure 3: Fitted Weibull and gamma                                             Figure 4: Distribution differences plot for
density curves and histogram of the non-                                       the Weibull fit and the non-negative arrival
negative arrival delays of an interregional                                    delays of IR2200N and the plot for the
train series IR2200N                                                           gamma fit and the delays

The distribution of departure times at stations can be used to predict the distribution of
outbound track release times and the distribution of arrival times at the following stations.
For the departure delays of trains, the Weibull distribution fits best to the data in 11 out of
18 studied cases. Furthermore, the Weibull fits have been accepted by the K-S test in most
cases. Figure 5 and Figure 6 visualize the goodness-of-fit of the Weibull and gamma
distributions for the departure delays of an intercity train series in the southbound
direction. It appears that the Weibull density fit matches with the data better than the
gamma fit and the maximum difference between the Weibull distribution and the
empirical distribution is smaller than the maximum difference between the gamma
distribution and the empirical one.
    For both non-negative arrival delays and departure delays, the Weibull distribution fits
the data generally better than the gamma distribution. This may result from a more
flexible property of the Weibull model than the gamma model for capturing the
distribution shape of a data set. The kernel density curves estimated for non-negative
arrival delays and departure delays are decreasing except for few cases. As a result, the
shape parameter of a Weibull distribution fit is generally smaller than 1.0. If the fine-
tuned shape parameter is around 1.0, the exponential distribution, as a special type of the
Weibull distribution, can be used to capture the stochastic variations of both non-negative
arrival delays and departure delays.

                                                                               7
0.015                                                                                                   0.2
                                                                Weibull fit                                                                                            Weibull fit

                                                                               Fitted distribution−Empirical distribution
                                                                gamma fit                                                    0.15                                      gamma fit
                      0.012
                                                                                                                              0.1
Probability density

                      0.009                                                                                                  0.05

                                                                                                                               0
                      0.006
                                                                                                                            −0.05

                                                                                                                             −0.1
                      0.003
                                                                                                                            −0.15

                         0                                                                                                   −0.2
                          0   60 120 180 240 300 360 420 480 540 600 660 720                                                     0   60 120 180 240 300 360 420 480 540 600 660 720
                                       Departure delay IC2400S [s]                                                                            Departure delay IC2400S [s]

Figure 5: Fitted Weibull and gamma                                             Figure 6: Distribution differences plot for
density curves and the empirical histogram                                     the Weibull fit and the departure delays of
of the departure delays of an intercity train                                  IC2400S and the plot for the gamma fit and
series IC2400S                                                                 the delays

5.2 Train process times

Train process times include the dwell times, running times, and track occupancy times of
trains. The dwell times of trains at a station are the difference between the arrival and
departure times. To estimate knock-on delays and departure delays of trains at a station,
we need to know the necessary dwell times of trains for passenger alighting and boarding
in the absence of hindrance from other trains. These necessary dwell times of trains are
defined as the free dwell times in Yuan [12]. To accurately incorporate the impact of the
knock-on delays of trains suffered before occupying a platform track or a junction in the
modelling of delay propagation in a railway network, the conditional distributions of train
running times and track occupancy times in case of (no) hinder are needed.
    To obtain the distribution of the free dwell times of trains, we need the corresponding
data. However, it is hardly possible to measure the free dwell time of an individual train
which is hindered at the platform track due to occupancy of the next signal block only
based on the track occupancy and release records, i.e. TNV data. Therefore, the free dwell
time distributions will be only fitted for early and late arriving trains respectively by using
the observed dwell times of a subset of those trains which are not hindered at the station.
In case of early arriving trains, the subset contains the trains whose outbound route has
been set to ‘Free’ and the departure signal to ‘Go’ before it is ready for departure assumed
at the scheduled departure time. In case of late trains, the subset includes the trains whose
outbound route has been set to ‘Free’ and the departure signal to ‘Go’ before it is ready
for departure after an assumed minimal dwell time of 30s. In fact, some trains which do
not belong to the subsets are not hindered by other trains, too, when they have a longer
necessary dwell time.
    Note that the free dwell time of an early arriving train is influenced by the arrival
earliness significantly, and the knock-on delay and the departure delay may be estimated
based on the scheduled arrival time and the free dwell time excluding the arrival earliness
[12]. Therefore, we take only this part of the free dwell time as the free dwell time for an
early train in this analysis.

                                                                               8
The statistical analysis results reveal that for the free dwell times of early arriving
trains, the Weibull distribution gives the best fit in 8 out of 15 studied cases. For the free
dwell times of late arriving trains, the Weibull distribution fits best to the data in 7 out of
18 studied cases. Additionally, the Weibull distributions fitted to the free dwell times of
both early and late arriving trains have been accepted by the K-S test except for one
studied case. Figure 7 shows the Weibull density fit, the second best, i.e. normal density
fit, kernel density estimate and empirical histogram for the free dwell times of the late
arriving trains of an interregional train series in the northbound direction. The fitted
Weibull distribution with a shape parameter of 1.8 matches with the kernel estimate for
the empirical data much better than the normal fit with respect to overall shape and the
tails of the distribution. The better fitting of the Weibull model is also illustrated by the
distribution differences plots given in Figure 8.

Figure 7: Weibull and normal fits, kernel       Figure 8: Distribution differences plots for
estimate and histogram for the free dwell       the Weibull and normal fits and the free
times of late arriving trains of an             dwell times of late arriving trains of
interregional train series IR2200N              IR2200N

The density estimate and empirical histogram of the free dwell times of trains are
generally skewed to the right. The free dwell times of early arriving trains cannot be
shorter than the scheduled dwell time. The free dwell times of late arriving trains may be
shorter than the scheduled time, but they are longer than the minimum time for opening
the doors of a train, announcing the departure, and closing the doors. Taking into account
the skewness to the right and the lower bound, a location-shifted Weibull distribution
better captures the variability of the free dwell times of trains than a normal distribution.
     In case of an early arrival, the train driver may not feel in hurry and hence wait for
some late passengers boarding the train. In case of a late arrival, the train may stop at the
platform longer than the minimum dwell time due to an increased number of passengers
boarding the train. Obviously, the probability of large free dwell times is very small.
Therefore, the probability density curve of the free dwell times of trains is generally non-
decreasing and unimodal, and the shape parameter of the Weibull distribution fit is mostly
bigger than 1.0. Only if the scheduled dwell time is much longer than the necessary times
for passenger alighting and boarding, the shape parameter of the Weibull distribution
fitted to the free dwell times of early arriving trains (excluding the arrival earliness) can
be smaller than 1.0.

                                                9
In the following, we deal with the conditional distributions of the running times of
trains on the preceding block of The Hague HS station and those of the occupancy times
of the adjacent junctions around this station. Considering an approaching train, if the
inbound route is released earlier than the train arrives at sight distance of the approach
signal of the station home signal, the train proceeds freely to the platform. Otherwise, the
train is hindered and has to decelerate and even to stop before the home signal. The
conditional distributions of train running and track occupancy times in case of (no) hinder
differ from each other significantly.
    To obtain those conditional distributions, the first step is to classify the data. By
comparing the arrival time of each train at the station approach signal to the clearance
time of the inbound route, we have extracted a data set suited for fitting the distribution of
freely running times in each studied case. A hindered approaching train will pass the
station home signal at a reduced speed if it may proceed to this signal. Otherwise, the
hindered train will pass the home signal accelerating after a stop before this signal. Since
the standstill of a train on a track is not recorded, we cannot directly identify whether or
not a hindered train stops before the home signal based on track occupancy and release
records.
    However, we have realized that the difference between the arrival time of a hindered
train at the station approach signal and the clearance time of the inbound route is generally
longer in case the train stops before the home signal. On the other hand, the difference
between the route clearance time and the passing time of the train at the station home
signal is generally shorter in case the train stops before the home signal than if it may
proceed. Adopting the k-means data clustering routine included in the statistical analysis
tool S-Plus [9], we split the data sample of hindered trains for each studied train series
into two separate parts which correspond approximately to the aforementioned two cases.
Applying the k-means data clustering method, the data sets suited for fitting the
conditional distributions of inbound junction occupancy times by each relevant passing
train series were also extracted.
    In case of a departing train, if it is hindered due to outbound route conflicts, it dwells
at the station for a longer time. However, it will not be hindered on the next track sections.
Thus, the conditional distributions are not applicable to the running times of trains on the
outbound track sections and outbound track occupancy times.
    For the distributions and conditional distributions of train running and track occupancy
times, either the Weibull or normal model may be the best fit to the data. However, a
generic distribution model suitable for capturing the variability has not been found. This
might be caused by the big variation of train speeds on the short track sections in the
complicated station and interlocking area. The data classification and the further data split,
which reduce the size of the data sample, also affect the determination of a generic
distribution model for the conditional train running and track occupancy times.

6   Conclusions
We have compared the K-S goodness-of-fit among several distribution models selected
for train event times and process times by fine-tuning the distribution parameters. The
track occupancy and release times recorded at the Dutch railway station The Hague HS
were used. It has been found that a location-shifted lognormal distribution can be
considered as the best model among the candidate distributions for both the arrival times
of trains at the platform track and at the station approach signal. The Weibull distribution
can generally be considered as the best distribution model for non-negative arrival delays,

                                               10
departure delays and the free dwell times of trains. The shape parameter of a Weibull
distribution fitted to either non-negative arrival delays or departure delays is mostly
smaller than 1.0. However, the shape parameter of a Weibull distribution fitted to the free
dwell times of trains is generally bigger than 1.0.
    These distribution fitting results can be used as input models of any analytical or
simulation models for predicting the propagation of train delays and the punctuality of
trains at certain stations. Based on this prediction, we are able to maximize railway
capacity utilization while assuring a desired reliability and punctuality level of train
operations. Although the distribution assessment has been done using the data recorded at
a case station, the fitted lognormal and Weibull distributions can be applicable to other
existing and new stations thanks to the flexibility of these two distribution models.
However, it should be kept in mind that the distribution parameters of train event times
and process times may vary in time and space. When timetables for a new infrastructure
project are designed and evaluated and real data of train event times and process times are
not available, the distribution parameters must be estimated as accurately as possible by
adopting advises from experts. Moreover, a sensitivity study of the modelling results to
the estimated parameters of the input distributions may be required.
    The distributions and conditional distributions of train running and track occupancy
times have also been studied. However, a generic distribution model suitable for capturing
the variability has not been found. Further research work will be carried out to capture the
variability by collecting a bigger size of data sample. The variability of train running
times between stations will also be studied.

References
[1]  Carey, M., Carille, S, “Testing schedule performance and reliability for train
     stations”, Journal of the Operational Research Society 51, pp. 666-682, 2000.
[2] Dekking, F.M., Kraaikamp, C., Lopuhaä, H.P., Meester, L.E., A Modern
     Introduction to Probability and Statistics, Springer, London, 2005.
[3] Goverde, R.M.P., Hansen, I.A., “TNV-Prepare: Analysis of Dutch Railway
     Operations Based on Train detection Data'”, In: Allan, J., Hill, R.J., Brebbia, C.A.,
     Sciutto, G., Sone, S. (eds.), Computers in Railways VII, pp. 779-788, WIT Press,
     Southampton, 2000.
[4] Goverde, R.M.P., Punctuality of Railway Operations and Timetable Stability
     Analysis, Ph.D. thesis, Delft University of Technology, 2005.
[5] Güttler, S., Statistical modelling of Railway Data, M.Sc. thesis, Georg-August-
     Universität zu Göttingen, 2006.
[6] Hansen, I.A., “Improving Railway Punctuality by Automatic Piloting”, In: 2001
     IEEE Intelligent Transportation Systems Conference Proceedings, Oakland (CA),
     USA, pp. 792-797, 2001.
[7] Hermann, U., Untersuchung zur Verspätungsentwicklung von Fernreisezügen auf
     der Datengrundlage der Rechnerunterstützten Zugüberwachung Frankfurt am Main
     (Investigation of the Development of Delays of Long-distance Passenger Trains
     Based on Data from the Computer-aided Train Monitoring Frankfurt am Main),
     Ph.D. thesis, Technischen Hochschule Darmstadt, 1996.
[8] Law, A. M., Kelton, W.D., Simulation Modeling and Analysis, McGraw-Hill Higher
     Education, 2000.
[9] MathSoft, S-Plus 2000, User’s Guide, Seattle, USA, 1999.
[10] Robinson, S., Simulation: The Practice of Model Development and Use, John Wiley

                                              11
& Sons, Ltd, Chichester, 2004.
[11] Wendler, E., Naehrig, M., “Statistische Auswertung von Verspätungsdaten
     (Statistical Analysis of Delay Data)”, Eisenbahningenieurkalender EIK 2004, pp.
     321-331, 2004.
[12] Yuan, J., Stochastic Modelling of Train Delays and Delay Propagation in Stations,
     Ph.D. thesis, Delft University of Technology, 2006.
[13] Yuan, J., Goverde, R.M.P. & Hansen, I.A., “Propagation of Train Delays in
     Stations”, In: Allan, J., Hill, R.J., Brebbia, C.A., Sciutto, G., Sone, S. (eds.),
     Computers in Railways VIII, pp. 975-984, WIT Press, Southampton, 2002.
[14] Yuan, J., Hansen, I.A., “Optimizing Capacity Utilization of Stations by Estimating
     Knock-on Delays”, Transportation Research Part B 41(2), pp. 202-217, 2007.

                                           12
You can also read