ENBIS-14 in Linz

21 – 25 September 2014; Johannes Kepler University, Linz, Austria Abstract submission: 23 January – 22 June 2014

My abstracts


The following abstracts have been accepted for this event:

  • Probabilistic Clustering of Panel Time Series Using a Time-Inhomogeneous Model Built Around Markov Chains

    Authors: Stefan Pittner (Vienna University of Economics and Business), Sylvia Frühwirth-Schnatter (Vienna University of Economics and Business)
    Primary area of focus / application: Mining
    Secondary area of focus / application: Modelling
    Keywords: Model-based clustering, Discrete time series, Bayesian inference, Markov chain Monte Carlo, Gibbs sampler, Markov mixture models, Applications in economics
    Submitted at 12-Jun-2014 21:34 by Stefan Pittner
    23-Sep-2014 14:20 Probabilistic Clustering of Panel Time Series Using a Time-Inhomogeneous Model Built Around Markov Chains
    Markov chains are well-suited for time series of a categorical nature. In fact, they are one of the simplest models for describing such time series with serial dependence.

    In this work, each of a given number of H different clusters is represented by a seperate Markov chain. The model M underlying our clustering procedure uses the assumption that a certain time series belongs to one of the H different clusters if and only if it has been generated by the Markov chain of the respective cluster.

    Two different time series which are generated by a single Markov chain generally have the following property: They are relatively far apart in one of the standard metrics (e.g., the Euclidean distance). This means that our model-based clustering approach is in contrast to the more common concept of distance-based clustering (such as k-means
    clustering): Our approach aims at the overall state-changing behavior of a time series
    instead of its functional form.

    The presentation starts with model M=M_1 which does a simple random cluster assignment for time series generation through a discrete distribution eta_1, eta_2, ..., eta_H over the H clusters.

    The major model M=M_2 a) allows each time series' cluster to depend on a vector of static covariates. Such a covariate vector has to be associated with each time series. The model is achieved by expressing each of the probabilities eta_h in model M_1 by a multinomial logit. That way, any covariate can be either discrete or continuous. b) Additionally, we apply a refinement over model M_1: Instead of clusters expressed by a single Markov chain for the whole time period, we provide a separate Markov chain for each subperiod in an equidistant partition inside each cluster. This feature helps the model to adapt to behavior changes over the course of a time series due to exogenous circumstances.

    Two additional models M=M_1a and M=M_2a arise from model M_1 and M_2, respectively, by letting the Markov chain of each cluster depend on discrete covariates
    with a finite number of levels. In model M_2a these covariates can be equal to, partially equal to, or different from the covariate vector of item (a) above. Moreover, in model M_2a these covariates can now be dynamic - they can vary with each subperiod of item (b).

    The parameters of each cluster model are estimated through Bayesian inference using an MCMC (Markov chain Monte Carlo) sampling scheme. It always consists of a three-stage Gibbs sampler which involves quick draws from standard distributions. There are no structural changes of the MCMC methods when moving from model M_1 to M_1a and from model M_2 to M_2a.

    The utilization of these kind of models is briefly demonstrated for an application in econometrics (employment status after bankruptcy) and a marketing application (customer purchase history of textiles).
  • Bayesian Modeling of Time Series of Counts

    Authors: Refik Soyer (Department of Decision Sciences - The George Washington University)
    Primary area of focus / application: Finance
    Secondary area of focus / application: Modelling
    Keywords: Time-Series analysis, State-Space, Environmental models, Poisson and negative-binomial time-series, Bayesian inference, Markov chain Monte Carlo
    Submitted at 16-Jun-2014 11:58 by Marco P. Seabra dos Reis
    22-Sep-2014 11:25 Bayesian Modeling of Time Series of Counts
    In this talk we consider modeling time-series of correlated counts which often arise in finance, operations and marketing applications. We discuss both univariate and multivariate time-series of correlated count data and introduce state-space and common environment models. In so doing, we consider correlated Poisson and negative-binomial time-series. We discuss both parameter driven and observation driven correlation structures and develop Bayesian inference of these models using Markov chain Monte Carlo methods. We introduce multivariate extensions and present applications such as modeling of call center arrivals, shopping trips and mortgage defaults.
    Joint work with Tevfik Aktekin (University of New Hampshire) and Bumsoo Kim (Sogang University, Seoul, Korea)
  • Measuring the Effectiveness of Process Improvement in a Non-Randomized Experiment

    Authors: Susana Vegas (Universidad de Piura), Valeria Quevedo (Universidad de Piura)
    Primary area of focus / application: Quality
    Secondary area of focus / application: Process
    Keywords: Process improvement, Regression discontinuity, Non-randomized experiment, Effectiveness measurement
    Submitted at 16-Jun-2014 18:02 by Susana Vegas
    23-Sep-2014 10:15 Measuring the Effectiveness of Process Improvement in a Non-Randomized Experiment
    In a continuous improvement process, a key issue is to “Check” the effectiveness of an implemented treatment. Without this, it is not possible to perform the “Action” correctly. Designing good experiments and performing adequate statistical tests are important to verify that the improvement is due to the treatment.
    In some cases, operation conditions prevent the performance of a randomized experiment. To overcome this situation, this paper will address the use of the Regression Discontinuity (RD) approach as an alternative procedure to estimate the effect of an experiment. RD can be used when the assignment of treatment is made using a known “threshold” of an explanatory variable.
    An application to verify the effectiveness of a remedial program in undergraduate students is shown. It was found that the remedial program has a marginal effect of 32.8 percent points on the students’ expected value for passing the first semester. The findings suggest that RD can effectively be used to control the performance under the conditions stated above.
  • Mixture Models for Text Mining in R

    Authors: Bettina Grün (Johannes Kepler Universität Linz)
    Primary area of focus / application: Mining
    Keywords: Bag-of-words model, Finite mixture model, Text mining, Topic model, R
    Submitted at 17-Jun-2014 19:11 by Bettina Grün
    Accepted (view paper)
    23-Sep-2014 15:00 Mixture Models for Text Mining in R
    The nowadays ready availability of large electronic document collections increases the need for automatic statistical tools to learn meaningful information from natural language text. Often these statistical methods rely on the bag-of-words assumption, i.e., the information how often terms occur in a document is assumed to be sufficient to suitably analyze the documents and the information in which order words occur is omitted. This implies that the data used for analysis is a document-term matrix containing the observed term frequencies of the documents which is generated from the texts.

    Among these bag-of-words models two different mixture models have been proposed: finite mixtures of von Mises-Fisher distributions and the latent Dirichlet allocation topic model. Finite mixtures of von Mises-Fisher distributions are fitted based on the assumptions that each document belongs to only one cluster and that only the directional information in the data is of importance. The latent Dirichlet allocation topic model is a generative model for the term frequencies in a document which aims at capturing the observed dependencies between them. Each document is assumed to be a mixture of several topics and each topic is characterized by its own term distribution.

    We give an introduction into these models and their estimation. Furthermore, we present the R packages movMF and topicmodels which allow to fit these models. Both packages build on and extend functionality from the text mining package tm. The functionality provided by the packages is outlined as well as their application illustrated.
  • Multivariate Data Analysis in the Big Data Era

    Authors: Alberto Ferrer (Universidad Politecnica de Valencia)
    Primary area of focus / application: Modelling
    Secondary area of focus / application: Process
    Keywords: Big data, Multivariate data analysis, Latent structures, Data analytics
    Submitted at 18-Jun-2014 13:33 by Alberto J. Ferrer-Riquelme
    24-Sep-2014 09:40 Multivariate Data Analysis in the Big Data Era
    “Big Data” is nowadays a fashion word. Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. According to Wikipedia it is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data is linked to a multi-Vs system: Volume, Velocity, Variety, Veracity. But big data is not synonym of success; the problem is how to get value from the data. In this talk we illustrate the potential of latent structures-based multivariate statistical techniques for smart business decision-making in industrial settings in the big data era.
  • Using Informative Missing to Build Models That Make Better Predictions

    Authors: Volker Kraft (JMP), Ian Cox (JMP)
    Primary area of focus / application: Modelling
    Keywords: Data quality, Informative missing, Prediction, Modelling, JMP Pro
    Submitted at 18-Jun-2014 18:21 by Volker Kraft
    22-Sep-2014 15:40 Using Informative Missing to Build Models that Make Better Predictions
    --- Note: Intended for Software Session --- As the number of variables in data increases, a naive strategy of row-wise deletion when any cell is missing becomes less and less desirable because very soon you may have no rows left to analyse. Informative missing is a simple but very effective coding system that allows the estimation of a predictive model despite the presence of missing values. It can code both continuous and categorical model effects. Using a series of examples in JMP Pro, this presentation shows the value of this approach when building predictive models using regression, neural networks and tree-based methods.