PROBABILITY DISTRIBUTIONS ASSESSMENT FOR MODELING GAS CONCENTRATION IN CAMPO GRANDE , MS , BRAZIL

The predominant air pollutants in urban cities are (NOx = (NO + NO2), O3 and (OX = (O3 + NO2). This research focused on pollutant variables that cause damage to human health as well as to the environment. Thus, seven statistical models {Weibull (W), Gamma (G), Lognormal (L), Frechet (Fr), Burr (Bur), Rayleigh (R) and Rician (Ri)} were chosen to fit the observations of the air pollutants. An average hourly data from one year to 2015 were considered. In addition, performance indicators {Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE)} were applied, to determine the quality criteria for adjustment of the frequency distributions. The best distribution that adapts to the observations of the variables was the RICIAN distribution, the log-normal distribution for COD. The probabilities of the concentration of exceedances were calculated,(predicted) from the cumulative density function (cdf) obtained from the best fit distributions.


INTRODUCTION
Air pollution in urban areas causes adverse effects on the human health and the environment.In addition, cities face increasing urban pollution and it has negative effects on the rapid population growth.Recent studies have also proven over time that industrialization and the use of motor vehicles are the two main contributors to urban air pollutions.
One of the main problems caused by air pollution in the urban areas is the presence of photochemical oxidizers.Among these pollutants, ozone (O3) and nitrogen dioxide (NO2) are particularly important since they are susceptible to provoking adverse effects on the human health (OMS,  2000).The formation of ozone at ground level depends on the intensity of the solar radiation, the absolute concentration of NOx and the VOCs (Volatile Organic Compounds), and the ratio between NOx and VOCs. 1 The ozone concentration increases with the growing intensities of solar radiation and the air temperature.[4][5][6][7][8] In Campo Grande, some studies and climate monitoring campaigns have been carried out, 2,[9][10][11][12][13][14][15][16][17] for studying the atmospheric dispersion modelling to explore the results of climate change.
In literature, probability distributions have been used to adjust the concentrations of air pollutants, including the: Weibull distribution, 18 Lognormal distribution, 19 Gamma distribution, 20 distribution of Rayleigh, 21 distribution of Gumbel 22 and Frechet's distribution. 23Using a variety of performance indicators, such as the: mean absolute error (MAE), root mean square error (RMSE), concordance index (d2), bias normalized absolute error (NAE), prediction accuracy) and the coefficient of determination (R2).
The objectives of this study are to adjust the probability distributions for the concentration of three air pollutants (NOx, O3 and OX) using seven statistical models.

Studied and observational data
Campo Grande is the capital city of South Mato Grosso (MS) state, located in the southern of Brazil Midwest region, sited in the center of the state.Geographically, the city is near to the Brazilian border with Paraguay and Bolivia.It is located at 20°26'34'' South and 54°38'47'' West longitude.Figure 1 shows the location of Campo Grande, capital of the state of Mato Grosso Sul (MS).It occupies a total area of 8,096.051km² or 3,126 mi², representing 2.26 % of the total state area, within 860,000 inhabitants (2016) and a corresponding HDI of 0.78.The urban area is approximately 154.45 km² or 60 mi², where tropical climate and dry seasons predominate, with two clearly defined seasons: warm and humid in the summer, and less rainy and mild temperatures in winter months.
During the months of the winter, the temperature can drop considerably, arriving on certain occasions to the thermal sensation of 0 ºC or 32 ºF with occasional and light freezing.The year average precipitation is usually at 1,534 mm, with small up or down variations.The main pollution problems in the city are attributed to the: traffic of vehicles, raise of building activities, the presence of dumping grounds, use of small power generators running on oil to supply power to the electric grid, and finally, to the induced fire outbreak used to clean up local terrains.

Ensemble of observational data
The air quality and meteorological variables are monitored by an automatic station operated at the Institute of Physics of the Federal University of South Mato Grosso (UFMS).This met station is located inside the university campus, about 8 km or 5 miles to the west of downtown.The main sources of pollution in that area are the building activities; therefore, there are no significant precursor sources of ozone identified close to the region.The ozone levels of Campo Grande area are stored in a regular database since 2004.
The equipment of measurements was installed at the top of a tower from where air samples are extracted throug vertical pipes that are placed approximately 2 meters above the ground level.
The equipment used for measurements include a nitrogen oxide analyzer (AC31M-using chemiluminescence method), an ozone analyzer (O341M-LCD/UV Photometry).All equipment was made by Environnement S.A.

Modelling of the climatological datasets
The statistical models (Weibul, Rayleigh, Gamma, Lognormal, Frechet, Burr and Rician) used for fitting of the observed datasets (NOx, OX and O3) are defined as follow:

Weibull (W) PDF
The Weibull probability density function (pdf) of a 2parameter distribution is given as the derivative of a cumulative distribution function (cdf) expressed in Eqn.(1)   (1) The Weibull cumulative distribution function (cdf) is given by Eqn.(2)   (2) where k and C are the shape and scale parameters of the Weibull distributions derived from the time series of the climatological datasets;  is the time series observations from each variable/dataset.Meanwhile, the shape parameter "k" is obtained from the maximum likelihood estimator (MLE) as expressed: (3) Once the k values are calculated, the scale parameter values are obtained from Eqn. (4) (4)   where N is the number of time series dataset points.Meanwhile, Eq. ( 3) is apply to each climate observations and solve iteratively with an initial guess of 2 (k=2) until k values converge after several iterations.2), the Rayleigh pdf of a continuous distribution fr (v,k,C), is given as: (5)   The Rayleigh cumulative distribution function (cdf) is given:

Gamma (G) PDF
The pdf of a gamma distribution is defined by Olaofe and Folly: 24 (7)   where fg and Γ(k) are the pdf of gamma distribution and the gamma function of (k), respectively.k and C are the shape and scale parameters of the Gamma distribution derived from the time series observations.The cumulative density function (cdf) of a Gamma distribution is defined as (8)   where Fg is the cumulative density function of a gamma distribution.

Lognormal (L) PDF
Lognormal was used to fit the ozone concentration data.The location parameter of the lognormal distribution is estimated from the expression: (9)   where  is the variance of the observed dataset and μ is the lognormal scale (sigma) parameter The scale parameter of the lognormal distribution is estimated as (10)   where is the  (location) parameter.
The probability density function and the cumulative distribution function of a lognormal pdf are defined below where  , μ, fl, Fl and erfc(ln-) 2 /2 2 are the location parameter, scale (sigma) parameter, lognormal pdf and cdf, and error function of (ln-) 2 /2 2 , respectively.
In another literature, 25 the lognormal distribution with probability density function was given by Lu:

Frechet (F) PDF
The density function of the generalized extreme value (GEV) distribution with shape (k≠0), location (µ) and the scale (δ) parameters are given: 26 (13)   where ff is the probability density function of a Frechet (GEV) distribution

Rician (Ri) PDF
The density function of a Rician distribution is given as: 27 (14)   where s≥0 and δ=0 are non-centrality and scale parameters, respectively; Ι0 is the zero-order modified Bessel function of the first kind.
The two parameters of the Rician distribution are estimated as: (15) (16)   where Ι1(z) is the first-order modified Bessel function of the first kind and z=is/ 2 .A good numerical optimization algorithm with a starting value is needed to solve Eqn.(15).

Burr (B) PDF
The density function (pdf) of the Burr distribution is given by the expression: (17)

Accuracy Test
The accuracy results are essential for determining the effectiveness of the statistical models.Thus, accuracy check is carried out by comparing the observed climate distribution with predicted/modeled distributions.The observed data is the values from the monitoring systems whereas the modeled datasets are obtained from the fitted distributions. 26The various tests for determining the goodness-of-fit of the models (pdfs) are expressed below:

Mean Absolute Error (MAE)
The mean absolute error is used for testing the predicted distribution of observed climatological variables (NOx, OX and O3) against the observed distribution.It is often defined as the mean of the absolute errors derived from the observed and predicted values.The mathematical equation is defined as: (18)   where xi is the observed values of the air pollutants;   is the predicted/modeled values from Weibull, Rayleigh, and Gamma, Lognormal models etc.

Root Mean Square Error (RMSE)
It is used for comparison of the predicted from the observed values.The root means square error for the best fit statistical model is given as: (19)

Mean Absolute Percentage Error (MAPE)
The mean absolute percentage error is calculated as: (20)

RESULTS AND DISCUSSIONS
The description statistics of average values of air pollutants for the sampling period (2015) was being shown in Table 1.The annual mean values of the gases (NO, NO2, NOx, OX and O3) was higher than the median, indicating a high concentration recorded for the studied period.Most of the data is concentrated to the left of PDF charts with few high values.There was an increase in mean, median, roughness and persuasion values, indicating a growing problem of air pollution in Campo Grande.

Hourly variation of O3, NO, NO2, OX and NOx concentrations
The average per diem variation observed for the NO, NO2, NOx, OX and O3 concentrations are presented in Fig 2 .Generally, the daily cycle of the ozone concentration reaches its peak at middle day and presents smaller concentrations during the night.The ozone concentration slowly increases after the first rays of the sunshine, getting to its maximum value during the daylight period, and after which it starts to decrease slowly until the next morning.
Figure 2 shows a displacement of about 2 hours in the morning between the NO and NO2 peaks.In the morning, NO2 is produced by oxidation of NO, 2 because NO can be converted to NO2 in the presence of peroxy radicals, but at night, NO and NO2 concentrations have a slight increase caused by increased in vehicular traffic during the rush hour (6:00 p.m.) and the influence of night boundary layer stability.At this time NO2 reached its peaks at 6:00 p.m.
Figure 2 shows an increase in O3 concentrations during the day, starting at 8:00 p.m. and peaking at 2 p.m. NO is converted to NO2 by reaction with O3, but during the daytime, NO2 is converted back to NO as a result of photolysis, which leads to O3 regeneration. 8O3 concentration in urban atmospheres peaked during the daytime from at 14:00 -15:00, when there is a maximum in solar radiation intensities and air temperature.This increase is by photolysis of NO2 and by the increase in the height of the boundary layer during the daytime that can result in the O3 mixture due to thermal stratification and convective heat transfer to the surface of the air at higher altitudes.After reaching the maximum concentration at 14:00-15:00 hr., the concentration of O3 decreases due to a decrease of the photochemical activity.
Higher OX concentrations occurred in the afternoon, thus revealing an influence of the photochemical processes. 5,8lso, OX decreases due to the absence of solar radiation at night.This lack of radiation hinders the formation of NO2 and O3 by photolytic reactions, as well as the reactions of NO2 with NO3, and of NO2 with O3. 28 While O3 and a large percentage of NO2 concentrations are the secondary contaminants, NO is a primary contaminant, formed through a complex set of chemical reactions.At 07:00 a.m, the sunlight begins to induce a series of photochemical reactions.NO is converted in NO2 through a reaction with O3.During the shining hours, NO2 is converted again into NO because of photolysis, which induces the regeneration of O3.
Another factor influencing the atmospheric air pollutant concentrations is the height of the mixture layer over the city.In a shiny day, the pollutants are diluted when the mixture layer increases during the day and stays limited to the inside of NPBL during the night.Emitted pollutants, like NO, are kept underneath (such an inversion), and it can cause an increase of hourly average concentration of NOx overnight.The basic chemistry that led to the production and destruction of ozone has been detailed elsewhere.The most used are coefficient of determination, Chi-square (χ 2 ) test results, Kolmogorov-Smirnov test (KS) and square root mean square error (RMSE).In most studies, a visual evaluation of overlapping adjusted pdfs To the histograms of the data is also performed The RMSE are applied in theoretical cumulative probabilities against empirical or theoretical cumulative probabilities of the concentrations of the observed variables.These statistics are also calculated with variable data in the form of frequency histograms.

Probability distributions assessment for gas concentration modelling
In addition to the analysis performed on the distributions of the variables, some authors also evaluated the adequacy of pdfs to adjust the concentration distributions obtained by the sample variables or to predict the concentrations.In this case, the pdfs are first adjusted to the data of the variables.Then, the theoretical distributions of concentration density are derived from the pdfs adjusted for the variables.Finally, the fit quality measurements are calculated using the theoretical density distributions and the distribution estimated from the NO, O3, OX variables of the sample.
Figure 3 shows seven PDFs, namely Weibull (W), gamma (G), log-normal (L), Frechet (Fr), Burr (Bur), Rayleigh (R) and Rician (Ri) Of the variables studied in the data set.Graphically, it can be seen that the Rician PDF produces the best fit.Rayleigh and gamma distributions correspond to the histogram to a lesser extent and provide the poorest adjustments.It can be seen from the figures that these variables present different forms of histograms.The parameter values obtained for these distributions and the assembly precision based on the performance index criteria presented in Table 2.It can be seen that both statistical indicators gave similar results in all cases.The Weibull (W), Rician (Ri), log-normal (L) functions provide the smallest adjustment error for the data sets.This is also verified in Figure 3. Statistical tests show that the Rician distribution is the best choice for the data set.However, the Weibull PDF also provides fairly accurate results for the variables.Rayleigh PDF gives a very poor performance and is a poor fit.The performance of these three PDFs to evaluate the concentrations of the variables were also analyzed and the results are summarized in Table 2.

Probability distributions assessment for gas concentration modelling
The Rayleigh PDF produced the maximum error between the PDFs and produced significant errors in the evaluation of the concentrations of the variables.Overall, Weibull, Rician, and lognormal PDFs resulted in fewer errors, and among the three functions, while Rician was ranked number 1 based on performance index criteria.It can be said that the evaluation of these distribution functions based on the quality of the adjustment criteria alone is not enough.These criteria should be used to identify appropriate distributions before a detailed analysis is made.As these PDFs installed can be used for different applications by the industries, public managers in decision-making, the performance of these PDFs for specific applications, such as prediction of the concentration of pollutants, should also be evaluated.The results show that there are an underestimation and overestimation of the concentration density of the pollutants in general, depending on the concentration range.The percentage errors mainly show that this underestimation and overestimation of the concentrations of these pollutants, which may be due to the heating effect and the atmosphere.
The distributions gamma has also been used to fit the probability density functions of daily air pollutant concentration. 29The pollutants studied have a different statistical distribution, due to the different diffusion characteristics of the individual pollutant in the air and to the interaction of diffusion characteristics and local geography, climatic conditions in Campo Grande.The distributions gamma has also been used to fit the probability density functions of daily air pollutant concentration. 29e current study showed that the pollutants studied O3, OX and NO had different statistical distribution.The difference might be due to the different diffusion characteristics of individual pollutant in the air, and the interaction of diffusion characteristics and local geography, weather conditions in Campo Grande.The underlying mechanisms need to be further explored.
The current analysis shows that the statistical distributions of better performance of several air pollutants in Campo Grande are different.For example, Nan-Hung Hsieh and Chung-Min Liao claimed that the probability distributions for all air pollutants in Taiwan were approximate to be a lognormal distribution. 30In addition, Neustadter 31 revealed that the total suspended particulate is obviously logically distributed, whereas sulfur dioxide and nitrogen dioxide are rationally estimated by lognormal distributions.However, Oguntunde 32 showed that the Gamma pdf is the best distribution model for the carbon monoxide concentration modelling in Lagos State, Nigeria.Hai-Dong Kan and Bing-Heng Chen indicated that the best fit distributions for PM10 concentrations in Shanghai were lognormal. 33 Malaysia, Noor et al. 26 found that the best distribution fits the PM10 observations in Nilai was the Gamma distribution while the log-normal distribution is more appropriate in Shah Alam.Razali et al. referred to lognormal distribution as the best distribution that fitted to the carbon monoxide data in Bangi, Malaysia. 34Accordingly, there is no common distribution of air pollutants and it differs from the studied region and time.It is important to carry out a comparative analysis in order to find out which distribution better fits the air pollutants in a particular location in order to provide a better estimate of the air quality at that location.
Table 2 presents the results tests for fitting for different distributions to the air pollutants data.The preferable results were highlighted by italicizing and bold.We found that out of the distributions considered, the Rician and Gamma distribution significantly fits with most of the air pollutants data in Campo Grande which are NO, O3 and OX, while O3 is fitted well with Weibull distribution.

Performance Indicators
The values of the performance indicators for the variables concentration in Campo Grande were tabulated in Table 2.A small value of the MAE indicates that the distribution of the Rician fits well the sampled data of the variables (O3 and OX), while the best fit that is appropriated is the function of Rayleigh.The smaller MSE and RMSE values indicate that the physician's distribution best fits the variables data while the lower COD value indicates that the Rician distribution fits the variable data

CONCLUSION
Based on the statistical characteristics of the concentrated air variables studied in Campo Grande, result findings indicate that the mean of the concentrations of the variables for the monitoring data sets was higher than the values of the medians showing that all observations are positively inclined to the right, with few extreme concentrations.The Weibull (W), gamma (G), log-normal (L), Frechet (Fr), Burr (Bur), Rayleigh (R) and Rician (Ri) distributions have been analyzed with the selected datasets.Performance indicators were also applied, which were mean absolute error (MAE), root mean square error (RMSE), the mean absolute percentage error (MAPE) to determine the quality criteria for the adjustment of the distributions.
The best distribution that adapts to the observations of the variables was the Rician, Weibull and the lognormal distribution.The pdf and cdf graphs obtained in this research can be used to predict the probabilities of exceedances.
The importance of statistical analysis in the field of atmospheric pollution for environmental engineering is shown in this research as it's useful for to adjustment of the data sets of pollutants with the best statistical model, in turn, to successfully estimate the exceedances of pollutants.
However, this work can still be improved with the application of other types of distributions and to adjust the monitoring data of the time series of air pollution.

Figure 1 .
Figure 1.Location of the Municipality of Campo Grande in the State of Mato Grosso do Sul, and the continuous air monitoring station located on the campus of the Federal University of Mato Grosso do Sul, Campo Grande, MS. w

Figure 2 .
Figure 2. Average of measured values for a daily period of NO, NO2, NOx, O3 and Ox concentrations.The interval between measurements equals 1 hour.

Figure 3 .
Figure 3. -Plots of the pdfs and cdfs for three air pollutant variables (NO, O3, OX) in Campo Grande.

Table 1 .
Descriptive analysis of pollutants for the sampling period (2015).

Table 2 .
Performance indicators for variables concentration in Campo Grande.