Brave New Data World
Francisco Louzada, ICMC-USP
The constant data generation has caused a fundamental change in the several economic sectors and in the consumption itself of goods and services. We are in a new age surrounded by data everywhere. Once, it was the time of mining, fishing, snooping, dredging... Now it is Big time. In this new world order of gigantic databanks, the statistical science has been experiencing adaptation processes in order to keep providing useful knowledge in an efficient way. Traditional strategies of analysis have been revised taking into consideration the magnitude of the information, the current contamination, the missing data, variables not identically distributed, the presence of non-stationary data streams, of sampling error and non-numerical variables. The necessity for developing alternative strategies is eminent. They should be based on sequential and adaptive methods of estimation, segmentation structures and a multiple combination of models, directly associated with efficient computing strategies which can respond timely. This presentation discusses the opportunities this brave new world offers to those who enter it and how we can contribute to the education of a new professional who can work inside it efficiently.
Beyond the infancy of Big Data Analysis: Learning from Istat experiences to grow up
Stefano De Francisci, Monica Scannapieco and Giulio Barcaroli, Istat, Italy
Big data are becoming more and more important as additional data sources for Official Statistics. After a period of experimental analyses, focused on investigating the potentialities that variety, velocity and volume of Big Data can have on several statistical domains, it is becoming more and more consolidated the use of specific methods, technologies and processes using new Big Data sources in statistical surveys (both on its own and in combination with traditional data sources). The challenge now is to go beyond the experimental stage of Big Data and enter the stage of maturity, exploiting and fitting the specificity of Big Data along the entire life cycle of statistical processes. With this goal, Istat set up three pilot projects with the aim to exploit three different kinds of Big Data sources, namely: (i) “new” sources enabling access to data not yet collected by Official Statistics; (ii) “additional” sources to be used in conjunction with traditional data sources and, (iii) “alternative” sources fully or partially replacing traditional ones.
In this presentation, we focus on the different scenarios of statistical processes underlying the three projects, in terms of methodological, technological and organizational issues, giving an overview of the main topics dealt. In the role (i), the project named “Persons and Places” makes use of Big data sources by performing mobility analytics based on mobile phone data (in particular, Call Detail Records), reporting information on the start/end time of a call and its spatial location. In this case, the experimentation is focused on the production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities, and the analysis was carried out by applying unsupervised clustering methods.
In the role (ii), the “Google Trends” project considers the possible usage of Google Trends results to improve prediction results, by focusing on nowcasting estimates of some indicators such as the unemployment rate. The objective was to exploit the time series query share extracted from Google Trends as auxiliary variables to improve the estimates produced by Istat through the use of model based estimation methods. The major outcomes are: the test of Google Trends on Italian data in the Labour Force domain, monthly prediction capabilities and possibly finer territorial level estimation (i.e. small area estimation) of Labour Force indicators.
In the role (iii), the project “ICT Usage” tests the adoption of Web scraping and text mining techniques for the purpose of studying the usage of ICT by enterprises and public institutions. Starting from respondent lists of enterprises and institutions, used by Istat within a dedicated survey, the project verifies the effectiveness of automated techniques for Web data extraction and processing. The analysis phase was carried out on unstructured texts obtained from a preliminary Web scraping phase, aimed at retrieving information about E-Commerce activities by enterprises directly from their Web sites, making use of supervised learning methods.
Mobile positioning data - Big Data source for tourism, travel, population, migration statistics
Siim Esko, Positium LBS/Estonia
Mobile phones are ubiquitous sensors of the society. Billions of anonymous locations are created every day by mobile phones in Brazil. Domestic and foreign mobile phone owners have become the largest sample group in the country providing statistical grade data. Carefully processed, this information can provide continuous data on various areas -- travel, population, migration. Positium provides official travel statistics based on mobile positioning data since 2009. But apart from existing official statistics, the data will help detect trends and define new statistical indicators.