Subido por sakalore

ProjectReport

Anuncio
CE 256: STOCHASTIC HYDROLOGY
SHORT PROJECT
DATE OF SUBMISSION: 04.12.2012
BIHU SUCHETANA
2ND YEAR M.E., WATER RESOURCES AND ENVIRONMENTAL ENGINEERING
SR No.- 08308
1
TABLE OF CONTENTS
I.
INTRODUCTION……………………………………………………3
II.
TIME SERIES PLOT………………………………………………..3
III.
MOMENTS OF THE DATA………………………………………..4
IV.
DISTRIBUTION OF THE DATA………………………………….4
V.
AUTO-COVARIANCE AND AUTO-CORRELATION………….6
VI.
PARTIAL AUTOCORRELATION………………………………..6
VII.
LINE SPECTRUM…………………………………………………..7
VIII.
POWER SPECTRUM……………………………………………….8
IX.
DIFFERENCED DATA……………………………………………10
X.
GENERATION OF MONTHLY DATA………………………….12
XI.
STANDARDIZED SERIES………………………………………..16
XII.
ARMA MODELS…………………………………………………..19
XIII.
SIMPLE LINEAR REGRESSION………………………………..20
2
INTRODUCTION:
Any engineering design with hydrological inputs will have some uncertainties involved. Analysis of these
inherent uncertainties is the main scope of “Stochastic Hydrology”. Hydrologists have always strived to
achieve a better predictive capability of stochastic processes. The key to better and more accurate
prediction lies in historical data; history provides a valuable clue to the future. Unless some drastic
changes occur in the catchment (like rapid urbanization, large scale deforestation etc), persistence of
statistical relations is seen to occur. Thus utilizing the available data for a catchment, we generate new
data which is not exactly same as the historical data; rather it has the same statistical properties as the
historical data. It is this quality of “persistence” that helps us in better prediction and forecasting.
Various “tools” can be used to analyze historical data and capture the information conveyed by them. In
our course we have studied some of these “tools” which I have attempted to use in this project. The region
chosen for analysis is Bardhhaman district in West Bengal, India. The monthly rainfall values were taken
for a period of 102 years (1901-2002) from the site www.indiawaterportal.org/met_data/. The rainfall data
is enclosed in the excel file named “RainfallWB”. Thus there are a total of 1224 data points.
TIME SERIES PLOT
A time series typically plots the values taken up by a random variable with time. The random variable in
this case is precipitation. Being located in the tropics, at Bardhaman, precipitation implies rainfall. The
time series plot for rainfall is as below:
Figure 1: Time series plot for the monthly rainfall at Bardhaman from 1901-2002
3
MOMENTS OF THE DATA:
The moments of the data provide valuable clues about the statistical properties of the data. The moments
obtained from the samples provide estimates about the population characteristics. In hydrology, the first
four moments have special significance. The first moment or mean is a measure of the central tendency.
The second moment or standard deviation is a measure of dispersion. The next 2 moments, skew and
kurtosis are a measure of symmetry and peakedness respectively. The first four moments calculated from
the sample data are as below:
MOMENT
1st
NAME
MEAN
VALUE
118.090403594771 mm
2nd
3rd
STANDARD DEVIATION:
COEFFICIENT OF VARIATION:
SKEWNESS
4th
KURTOSIS
132.158107035107 mm
1.119
1.13889350294147 (skewed to
right)
3.42265003752722 (leptokurtic)
Table 1: First 4 moments for the original series
The relations used are as below:
Mean,
=
𝑛
𝑖=1 xi
Variance, S2 =
𝑛
𝑛
𝑖=1
𝑥−𝑚𝑒𝑎𝑛
2
𝑛−1
Standard Deviation, S= 𝑆 2
Coefficient of Variation, Cv =
Coefficient of skew, Cs = 𝑛
Kurtosis coefficient, K =
𝑆
𝑛
𝑖=1 (𝑥𝑖 −𝑥)
3
(𝑛−1)(𝑛 −2)𝑆 3
𝑛 2 𝑛𝑖=1 (𝑥𝑖 −𝑥)
4
(𝑛−1)(𝑛−2)(𝑛 −3)𝑆 4
DISTRIBUTION OF THE DATA
The pdf of the data is plotted and is seen to take the following shape. Intuitively, the pdf and the value of
kurtosis, which is close to 3 indicates the possibility of normal distribution, but the negative values are
absent in this case as rainfall cannot take up negative values. So, we check whether the data follows lognormal distribution. By performing the Kolmogorov-Smirnov test (K.S. test) using the available
MATLAB function, it is verified that the data follows log-normal distribution with a mean of 4.77mm
and a standard deviation of 4.884mm. Similarly, for the annual average rainfall values, the pdf appears to
4
follow normal distribution, which is confirmed by the K.S test. The annual average rainfall values have a
mean of 118.09mm and standard deviation of 18.094mm.
Figure 2: PDF of the rainfall data
Figure 3: PDF of the annual average rainfall data, following normal distribution
5
AUTO-COVARIANCE AND AUTO-CORRELATION:
Auto-covariance indicates the co-variance between points of a series and other elements of the same
series separated by a lag of „k‟. The auto covariance and auto correlation are found for the rainfall data.
The auto-correlation and the auto-covariance matrices are saved in the folder named “Autocov and
Autocorrel”. Now, we plot the auto-correlogram and partial auto-correlogram for the given data. We
consider lag up to 0.25 times the number of data points, ie., up to 306. The correlogram hence plotted
indicates the memory of the process, ie, how far into the past the process can remember. The correlogram
takes the following shape:
Figure 4: Auto-correlogram at lag 306 (significance bands at 95%)
PARTIAL AUTOCORRELATION
Partial auto correlation indicates the explanatory power of one of the variables in regression when the
dependence of all other variables has been removed or partialled out. The partial auto correlations plotted
against the values of lag give the partial auto-correlogram. The partial auto-correlogram is as shown:
Figure 5: Partial auto correlation at lag 306 (significance bands at 95%)
6
The decaying nature of the correlogram indicates the possible presence of periodicity in the data. To
capture the periodicity in the data, we obtain the spectral densities in the frequency domain. The line
spectrum and power spectrum are used to identify these periodicities.
LINE SPECTRUM:
The line spectrum gives the amount of variance per unit frequency. The relations used for calculation of
line spectrum are given as below:
𝐼𝑘 =
𝑁 2
𝛼 + 𝛽𝑘2
2 𝑘
𝜔𝑘 =
2
𝛼𝑘 =
𝑁
𝑘
𝑁
𝑋𝑡 𝑐𝑜𝑠 2𝜋𝑓𝑘 𝑡
𝑡=1
𝑁
2
𝛽𝑘 =
𝑁
𝑓=𝑁
2𝜋𝑓𝑘
𝑁
𝑋𝑡 sin⁡
(2𝜋𝑓𝑘 𝑡)
𝑡=1
where, 𝑘 = 1,2,3, … . .0.25𝑁
The line spectrum of the given data is as below:
Figure 6: Line Spectrum of the Original Series
From the line spectrum we can notice two significant periodicities corresponding to ω=0.5236 and
ω=1.0472. The periodicity, P may be given as:
𝑃 = 2𝜋/𝜔
So, the given data has a periodicity of 6 months and 12 months.
7
Statistical test for significance of periodicities:
A statistic is defined as below (Kashyap and Rao, 1976):
𝜸𝟐
∩ = 𝟒𝝆 (𝑵 − 𝟐)
Where, γ2 = αk2+ βk2
2
𝛼𝑘 =
𝑁
2
𝛽𝑘 =
𝑁
ρ=
𝑁
𝑋𝑡 𝑐𝑜𝑠 2𝜋𝑓𝑘 𝑡
𝑡=1
𝑁
𝑋𝑡 sin⁡
(2𝜋𝑓𝑘 𝑡)
𝑡=1
𝑁 (𝑋𝑡 −𝛼𝑘𝐶𝑜𝑠
𝑡=1
𝜔𝑘𝑡 −𝛽𝑘𝑆𝑖𝑛 (𝜔𝑘𝑡 )
𝑁
, and
N: total number of data points (1224 in this case)
For testing the periodicity associated with a particular ωk , ∩ is compared with F(2, N-2)
F(2,N-2)=3 for N > 120 at 95% confidence.
Table 2: Test for significance of periodicities.Both periodicities are found to be significant at
95%confidence
POWER SPECTRUM:
The line spectrum is a statistically inconsistent estimate. To get a statistically consistent estimate, we plot
the power spectrum. Tukey windows are used for estimating the lag window ʎj. A maximum lag of 0.25
times the length of the data is used. The plot clearly exhibits a smoothened appearance as compared to the
line spectrum. The relations used here are as follows:
fk =
𝑘
𝑁
2
αk = 𝑁
𝑁
𝑡=1 𝑥𝑡 cos(2𝜋𝑓𝑘𝑡)
8
2
βk = 𝑁
𝑁
𝑡=1 𝑥𝑡 sin(2𝜋𝑓𝑘𝑡)
Frequency, ωk =
2𝜋𝑘
𝑁
Power Spectral Density, Ik = 2[Co +2
𝑁 −1
2
𝑗 =1
CjCos(2πfkj)λj]
Where Cj: Covariance at lag j
Co: Variance
Tukey Window λj:
1
(1
2
𝑗
+ 2𝐶𝑜𝑠(2𝜋 𝑀 )
Where M is the maximum lag= 0.25N
N: Length of data
Figure 7: Power spectrum for the original series, exhibiting smoothened appearance
9
DIFFERENCED DATA:
A differenced series removes non-stationarity in data. Using first order differencing, we construct a
new series where:
Yt=Xt-Xt-1
Where Yt= tth term of the differenced series
Xt and Xt-1: tth and (t-1)th term of the original series
From the differenced series we obtain the auto-correlogram, partial-autocorrelogram, line spectrum
and power spectrum. It is noted that both the differenced line and power spectra are similar to the line
and power spectra for the original data.
Figure 8: Auto correlogram of differenced series at lag 305
10
Figure 9: Partial-Auto correlogram of differenced series at lag 305
Figure 10: Line spectrum for the differenced series
11
Figure 11: Power spectrum for the differenced series
GENERATION OF MONTHLY DATA:
Generation of monthly data for 50 years is done using the Non-stationary First order Markov Model or
the Non-Stationary Thomas-Fiering Model. The basic equation used is:
𝑿𝒊,𝒋+𝟏 = 𝝁𝒋+𝟏 + 𝝆𝒋+𝟏 ×
𝝈𝒋+𝟏
× 𝑿𝒊,𝒋 − 𝝁𝒋 + 𝒕𝒊,𝒋+𝟏 × 𝝈𝒋+𝟏 × 𝟏 − 𝝆𝟐𝒋
𝝈𝒋
where ,
i denotes the year (1 to 50);
j denotes the month (1 to 12);
𝑋𝑖,𝑗 =Rainfall value in jth month of ith year,
𝜇𝑗 =Mean value of rainfall in the jth month,
𝜍𝑗 = Standard deviation of rainfall for the jth month,
𝜌𝑗 = Lag 1 correlation between jth month and (j+1)th month,
𝑡𝑖,𝑗 +1 = Standard normal Deviate
12
Table 3: Correlation of data of a particular month with the next month
The generated values of rainfall for the next 50 years are enclosed in Sheet 1 of the MS Excel file called
“Generated Values”.
MOMENT
1st
NAME
MEAN
2nd
STANDARD DEVIATION
3rd
SKEWNESS
4th
VALUE (generated data)
114.1 mm
VALUE (original data)
118.090403594771 mm
127.1 mm
132.158107035107 mm
1.0022 (skewed to right)
1.13889350294147
(skewed to right)
KURTOSIS
2.8961 (platykurtic)
3.42265003752722
(leptokurtic)
`Table 4: Comparison of the first 4 moments of generated and original data
The values of the moments calculated from the generated data and the original data are approximately
equal to each other. This shows that the statistical properties of the original data have been retained during
generation.
Some of the generated values are observed to have negative values. This is practically not feasible. Hence
while using these generated values for reservoir operation we must make sure that the negative values are
eliminated. The negative values, however, must be preserved for generation of values in the next time
step. For operation and decision making purposes, the generated values which we use are shown in Sheet
2 of the MS Excel file “Generated Values”.
13
The time series plot for the generated values is as below:
Figure 12: Time series plot for the 50 years’ generated values
Figure 13: Auto-correlogram for the 50 years’ generated values
14
Figure 14: Partial-Auto-correlogram for the 50 years’ generated values
Figure 15: Line spectrum for the generated data, showing periodicity corresponding to
approximately w=0.5236, similar to original data
15
Figure 16: Power spectrum for the generated data, showing periodicity nearly corresponding to
original data
STANDARDIZED SERIES:
Standardization of the original time series maybe done in 2 ways as follows:
(a) Using the long term mean and standard deviation for standardization,
(b) Using the monthly mean and standard deviation for standardization.
Standardization implies subtracting the mean from the original data and then dividing it with the
standard deviation:
𝑋𝑠𝑡 = (𝑋𝑖 − 𝜇)/𝜍
Where Xst= Standardized value
Xi= Original Value
µ=Mean of the Original Series
𝜍=Unbiased Standard Deviation of the Original series
16
The advantage of standardizing is that the periodicities are removed in the resultant series. In this case, it
is seen that the standardization using the second technique, i.e., by using monthly mean and standard
deviation yields a series devoid of periodicity.
Figure 17: Comparative time series plot of original and standardized data
Figure 18: Auto-correlogram for the standardized data
17
Figure 19: Partial-Autocorrelogram for the standardized data
Figure 20: Line spectra for the original and standardized series. The periodicities are seen to have
been removed
18
Figure 21: Power spectra for the original and standardized series. The periodicities are seen to have
been removed
ARMA MODELS (Auto Regressive Moving Average Models):
ARMA models are used both for one time-step ahead forecasting, ie, operation problems as well as long
term generation, ie, planning problems. The steps to be followed, in order, are model identification,
parameter estimation and validation. Here, the model selection is based on the maximum likelihood rule
(Kashyap and Rao, 1976) which is as below:
𝐿𝑖 = −
𝑁
∗ ln 𝜍𝑖 − ƞ𝑖
2
where, Li=Likelihood of the ith model
N= Total number of data points.
𝜍i= standard deviation of the residual series of the ith model
ƞ𝑖 = Total number of parameter, sum of the number of AR and MA parameters
The maximum likelihood estimate is in agreement with the Principle of Parsimony by Box and Jenkins
(1970). The auto-correlation shows a sinusoidal decay and the power spectrum shows the dominance of
the middle frequencies. This may lead to an initial guess that AR models maybe suitable for data
generation. In hydrology we typically go up to AR(6) models for data generation and forecasting. The AR
models maybe represented by the following general equation, where 𝜖𝑡 represents the residual series and
p represents the order of the AR model:
19
𝑝
𝑋𝑡 =
Φ𝑖 𝑋𝑡−𝑖 + 𝜖𝑡
𝑖=1
Using the “armax” function in MATLAB, we obtain the parameters of the AR model, which are
tabulated as below:
Table 5: Values of the AR parameter 𝚽
Of the 6 candidate models, it is seen that the ARMA(1,0) model yields the best results. The generated
values are seen to have a mean of 92.4534mm and standard deviation of 103.3727mm, which are close to
that of the original series. The skewness coefficient is 1.0827 and the coefficient of kurtosis is 3.092. So,
of the 6 candidate models the AR(1) model can be used for most accurate data generation. The residual
series (when applicable) should ideally exhibit properties of white noise, ie, should have a mean of 0,
should be uncorrelated and should not have any periodicity.
SIMPLE LINEAR REGRESSION:
Using this technique we try to fit a linear equation between the dependent and independent variable, with
the help of available data. Correlation between variables X,Y is given by:
ϒxy =
𝑛
𝑖=1 (𝑥𝑖 −𝑥′)(𝑦𝑖 −𝑦′)
𝑆𝑥 𝑆𝑦
xi,yi=Data points in the two series which are to be regressed
x‟,y‟ = Mean of the respective series
sx, sy= Standard deviations of the respective series
20
Table 6: Values of avg. monthly precipitation, avg. monthly temperature & avg. monthly cloud
cover
The correlation coefficients are given as below:
Precipitation
Average
Average monthly cloud cover
monthly
temperature
Precipitation
1
0.275703521
0.983832629
Average monthly
0.275703521
1
0.38316949
0.983832629
0.38316949
1
temperature
Average monthly
cloud cover
Table 7: Correlation coefficients between given variables
From the above table it is clear that precipitation and average monthly cloud cover have a very strong
correlation while precipitation and average monthly temperature do not have as strong a correlation. So,
using simple linear regression, we try to fit a relation between precipitation and average monthly cloud
cover.
Equation for a straight line is given by:
y= a+bx
The predicted value of y, y‟ is given by:
y‟=a+bxi
21
Using the least square error method, the values of the coefficients a and b are calculated as:
b= Ʃ(xi-xmean)*(yi-ymean)/Ʃ(xi-xmean)
= 0.177376601
a=ymean-b*xmean
= 25.39079848
Where xi=observations of x (precipitation)
yi=observations of y (average monthly cloud cover)
xmean= average value of temperature
ymean= average value of cloud cover
So, the regression equation between Precipitation (x) and average monthly cloud cover (y) is:
y= 0.177376601x+25.39079848
As average monthly cloud cover and average monthly temperature are correlated, we cannot use multiple
linear regression to find the relation between these two variables and precipitation.
-------------------X---------------------
22
Descargar