In the United States the way cities gather and publish data varies widely. There are still many issues that local authorities need to address in order to enable all kinds of users to make good use of the facts and figures being collected.
Urban data is of great value for research purposes and “will be essential to inform both policy and administration and enable cities to deliver services effectively, efficiently, and sustainably while keeping their citizens safe, healthy, prosperous, and well-informed”, say the authors of a recently published research report entitled Structured Open Urban Data: Understanding the Landscape. The researchers – Luciano Barbosa from IBM Research, Rio de Janeiro, Brazil, Kien Pham from the Department of Computer Science and Engineering, NYU School of Engineering, New York, and other colleagues – set out to gain a better understand of the current situation and assess the challenges and opportunities for finding, using, and integrating urban data. They collected over 9,000 data sets for 20 cities in North America. Since the first urban data set was created in 2009 in Seattle, Washington State, the number of available data sets has continued to increase.
The researchers found a strong correlation between population size and the number of data sets available. The complexity of urban organisation is proportional to its population. The authors are encouraged by what is happening so far but underline the need for greater consistency in the way the data is collected and stored. For one thing, the data is not always updated. As much as 70% of it has not been changed at all since it was first posted online.
Data sets need to be both accessible and transparent
Whether or not data can be used easily depends largely on how it has been formatted. In fact, most of the 70GB of city data collected is now available in tabular formats – such as what is known as the character-separated values (CSV) file format, i.e. numbers and text in plain-text form – so that it can be integrated and used, with the potential for linking, aggregating and cross-checking. However, cities still vary in the way they present their data, which can hamper its meaningful use by residents, researchers and policymakers.
The benefits of integrating city data have already led to a number of success stories. For instance by combining data from multiple agencies and then applying predictive analytics, the New York City authorities managed to increase the rate of detection of dangerous buildings and also improved the return on the time of building inspectors looking for illegal apartments.
In the era of Big Data, redundancy is a key issue when data is gathered from a number of sources. The report highlights the fact that several areas may submit overlapping or identical data. Meanwhile open data can also suffer from serious lacunae: in New York City data tables, as many as 30% of the columns may be empty, note the authors. In addition, some local authorities have not implemented their systems properly. Last but not least, the titles given to some columns and data tables are manifestly inadequate for identifying the information they contain, which leads to dual collection and data pile-up, in the worst case making the information indecipherable.
Helping users by not fine-tuning
Another key issue highlighted in the IBM Research/NYU School of Engineering report is the whole question of centralising the datasets emanating from the various cities. The authors underline that the basic requirement must be to make useful data available. The data may have been entered correctly and able to be read and analysed by a computer, but it should also be made readily available to anyone who wishes to use it. Given that the data tables they studied had been downloaded on average less than a hundred times, they call into question whether this availability criterion is actually being met.
If cities simply go it alone, publishing data on their own individual websites without any wider coordination, this will not help to connect up one city’s data with another, point out Luciano Barbosa and Kien Pham. Country-wide homogenisation of data tables would help both policymakers and ordinary citizens who wish to consult the data. On the other hand, they warn against a priori data aggregation, which basically means making a pre-selection of data without taking on board the real needs of potential users.