ST 11 – Big data – Academia e Estatística

O Admirável Mundo Novo de Dados

Francisco Louzada, ICMC-USP

A frequente geração de dados tem promovido uma mudança fundamental nos vários setores da economia e no próprio consumo de bens e serviços. Estamos em uma nova era, cercados por dados por todos os lados. Enormes massas de dados advindas de processos de coleta automática, instrumentação eletrônica, transações on-line e dados históricos coletados ao longo de muitos anos. O momento é do big, antes do mining, fishing, snooping, dredging... Um novo mercado de consumo e um novo varejo emergindo sob novas tecnologias e comportamentos. Nesta nova ordem mundial de grandeza dos conjuntos de dados a ciência estatística tem experimentado processos adaptativos com intuito de continuar provendo conhecimento útil de forma eficaz. Estratégias tradicionais de análise tem sido revistas diante da grandeza da informação, da presente contaminação, mas também de dados faltantes, de variáveis que não são identicamente distribuídas, da presença de não estacionariedade, de vício e variáveis não numéricas. É eminente a necessidade de desenvolvimento de estratégias alternativas baseadas em métodos de estimação sequencial e adaptativos, estruturas de segmentação e múltiplas combinações de modelos, associadas de forma direta a estratégias computacionais eficientes que fornecem respostas em tempo real. Esta apresentação discute as oportunidades que este admirável mundo novo oferece aos que nele adentram, diante do contraponto entre a modernidade ultra-estruturada dos mecanismos de captação de dados e as insatisfações estatísticas inerentes às nossas percepções teóricas, e como podemos contribuir para a formação de um novo profissional necessário para atuar dentro dele de forma eficiente.

Beyond the infancy of Big Data Analysis: Learning from Istat experiences to grow up

Stefano De Francisci, Monica Scannapieco and Giulio Barcaroli, Istat, Italy

Big data are becoming more and more important as additional data sources for Official Statistics. After a period of experimental analyses, focused on investigating the potentialities that variety, velocity and volume of Big Data can have on several statistical domains, it is becoming more and more consolidated the use of specific methods, technologies and processes using new Big Data sources in statistical surveys (both on its own and in combination with traditional data sources). The challenge now is to go beyond the experimental stage of Big Data and enter the stage of maturity, exploiting and fitting the specificity of Big Data along the entire life cycle of statistical processes. With this goal, Istat set up three pilot projects with the aim to exploit three different kinds of Big Data sources, namely: (i) “new” sources enabling access to data not yet collected by Official Statistics; (ii) “additional” sources to be used in conjunction with traditional data sources and, (iii) “alternative” sources fully or partially replacing traditional ones.

In this presentation, we focus on the different scenarios of statistical processes underlying the three projects, in terms of methodological, technological and organizational issues, giving an overview of the main topics dealt. In the role (i), the project named “Persons and Places” makes use of Big data sources by performing mobility analytics based on mobile phone data (in particular, Call Detail Records), reporting information on the start/end time of a call and its spatial location. In this case, the experimentation is focused on the production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities, and the analysis was carried out by applying unsupervised clustering methods.

In the role (ii), the “Google Trends” project considers the possible usage of Google Trends results to improve prediction results, by focusing on nowcasting estimates of some indicators such as the unemployment rate. The objective was to exploit the time series query share extracted from Google Trends as auxiliary variables to improve the estimates produced by Istat through the use of model based estimation methods. The major outcomes are: the test of Google Trends on Italian data in the Labour Force domain, monthly prediction capabilities and possibly finer territorial level estimation (i.e. small area estimation) of Labour Force indicators.

In the role (iii), the project “ICT Usage” tests the adoption of Web scraping and text mining techniques for the purpose of studying the usage of ICT by enterprises and public institutions. Starting from respondent lists of enterprises and institutions, used by Istat within a dedicated survey, the project verifies the effectiveness of automated techniques for Web data extraction and processing. The analysis phase was carried out on unstructured texts obtained from a preliminary Web scraping phase, aimed at retrieving information about E-Commerce activities by enterprises directly from their Web sites, making use of supervised learning methods.

Mobile positioning data - Big Data source for tourism, travel, population, migration statistics

Siim Esko, Positium LBS/Estonia

Mobile phones are ubiquitous sensors of the society. Billions of anonymous locations are created every day by mobile phones in Brazil. Domestic and foreign mobile phone owners have become the largest sample group in the country providing statistical grade data. Carefully processed, this information can provide continuous data on various areas -- travel, population, migration. Positium provides official travel statistics based on mobile positioning data since 2009. But apart from existing official statistics, the data will help detect trends and define new statistical indicators.