The theory of infinite probabilistic databases

Lindner, Peter; Grohe, Martin (Thesis advisor); Kimelfeld, Benny (Thesis advisor)

Aachen : RWTH Aachen University (2021)
Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2021


Probabilistic (relational) databases extend the relational model of databases by a concept of uncertainty that is expressed in terms of probability distributions. Such an extension is reasonable and useful for a variety of application scenarios in which there is either inherent uncertainty about the concrete shape of data, or which profit from the introduction of additional uncertainty. The standard model of probabilistic databases is based on a "possible worlds"-approach, that associates every possible manifestation of the data with probabilities. Although databases, logic, and stochastic models are usually defined over infinite universes, for example the real numbers, the conventional model only covers discrete probability spaces over finitely many facts.In this work, we extend the formal "possible worlds"-approach of probabilistic databases to the (uncountable) infinite setting. This introduces a number of new, non-trivial problems and questions, first and foremost the construction of suitable probability spaces and the well-definedness of query semantics. Our model of standard probabilistic databases that we develop for this purpose is based on the probability-theoretic notion of finite point processes, and we show that it indeed has the desired properties. In addition to the plain "possible worlds"-approach, various independence assumptions play a major role in finite probabilistic databases. Therefore, we investigate such assumptions also for infinite probabilistic databases. We characterize the existence of probabilistic databases with such properties and introduce new notions for independence assumptions in different new settings. This is based, again, on notions from point process theory. With respect to independence assumptions we present first studies regarding representations, representability and query answering. Finally, we discuss a particular probabilistic language—generative Datalog, which has been developed by Bárány, ten Cate, Kimelfeld, Olteanu and Vagena in order to conjoin probabilistic programming with the declarative nature of database query languages. This language is both suitable as a query language, and for representing infinite probabilistic databases. We embed the language into our framework of standard probabilistic databases and equip it with an improved semantics that also supports continuous distributions and probabilistic inputs. With our results and concepts, we establish a theory of infinite probabilistic databases and make an important contribution towards the understanding of the mathematical background of probabilistic relational databases in general.