Public data

By Bjørn Borud , 2021-09-17

When the state maintains data that is supposed to be public, it isn’t always that leaders from a bygone era understand what this implies. The process of dragging laggards into the present can be painful - and you may end up having to take it to court.

This was written from a Norwegian perspective. Not all parts of this may be relevant to the rest of the world.

Some months ago I wanted to tinker a bit with some data that is supposed to be entirely in the public domain. The data is maintained by a government organization and its maintainance is paid for with tax money. You would think that getting access to the data ought to be straight forward.

“That’s none of your business”

However, I was asked to fill out a form. One of the questions I had to answer was “why do you need access to this data?”. I must admit I was a bit dumbstruck. The data was supposed to be open; available to everyone. And they are asking me why I want to access it. Implying that there are right answers and wrong answers.

Yes, there may be technical usage patterns that can overload serving infrastructure that you want to mitigate or avoid, but denying people access to data that is legally in the public domain is very close to abuse of power, if not the very definition of it.

And of course, abuses of power should not be allowed to pass without consequence.

My impulse was to write “that is actually none of your business, please do not ask me again”. But I didn’t since I was more interested in the data than I was in starting some argument with some faceless bureaucrat. Some other day perhaps.

But this irked me. This is exactly the kind of behavior you do not want in people who are tasked with the job of managing public data. They have to understand their job a lot better than this.

If you are a government agency maintaining some public data set, what people will use it for is none of your business. Your business is to make it available in the most convenient and cost efficient manner possible.

Which brings us to the next topic.

Standards and expensive mistakes

One common mistake is to believe that you always have to employ some technical standard when distributing data. There exists standards for every imaginable type of data. Some are good, and some only get in the way.

If wrangling the data into conforming with some standard is time consuming and expensive it is probably going to be time consuming and expensive to unwrangle the data. And if this is the case you have defeated the purpose of making the data available in the first place.

I tend to visualize this situation as two people living in the same valley deciding to climb a 2000m mountain peak to engage in some transaction. And then, transaction completed, try to make it down the mountain again. Both expending work uselessly just to meet.

The idea of ignoring standards gives bureaucrats the heebie-jeebies because they often understand nothing of the technical aspects, but feel that if a standard is involved, their behinds are covered. But this isn’t good enough.

Claiming “high cost”

There’s a company whose data I have been trying to access for a few years now, but have given up every time. No, I am not going to name names. The company in question operates in a sector where, by law, they are obliged to make part of their operational data available to the public. To do this they have chosen some really involved standard to represent the data, which is a royal pain in the neck since the standard is so obsolete there are no good libraries to interpret the data. But the worst bit is that their API is really slow and often times out. We’re talking up to 15 minutes to respond to a single HTTP request. That’s not slow. That’s completely unacceptable.

On paper they are in compliance. In practical reality they are not. When this was pointed out years ago they complained that “it is too expensive to serve many users”. They even had the gall to complain to the local municipality for “accessing the data too frequently” when they polled the data somewhere on the order of every 30 seconds or so.

I can provably implement an equivalent API (for delivering the data, not necessarily implementing the horrible mess of a standard they have chosen to use) that can deliver response times on the order of what an IP packet round trip to their server takes. I know this because this is what I’ve done for a living for a few decades. Also, because it isn’t a hard problem to maintain a small data structure that has a modest amount of updates per minute in RAM and serve it rapidly - either to streaming clients or to clients that use request/response. Any half decent programmer can do this.

I actually offered to write the software needed to replace the system they have today. At no cost. AND to open source the software so that they could just point people who want to serve a large user-base to a github repository and say “we’d appreciate it if you set up your own server if you’re going to serve high volumes” - thus removing any concerns about scalability. I thought it was a pretty fair deal.

I never heard back from them, though. Too bad, because now that I’m doing a startup I can’t afford to do it for free so they’d have to pay me. However, I suspect it would still be relatively cheap.

But it does make you wonder what their motivations are. Are they just that stupid?

The talk that needs to take place

Making data publicly available is often difficult because people start at the wrong end. People start by building complexity and not by making the data properly available as their first action. There are lots of projects that pretend to organize public data, but every single one of them seems to end up with a catalog of dead or decaying data sources. So clearly, that isn’t the way to do it.

Stop pouring money into those kinds of projects. Instead look at how one might lend assistance to organizations that lack the ability to make data public. By funding computing resources and possibly also developers.

Focus on how to build easy to use, resource efficient, low maintenance solutions that can leverage communities of developers and the companies that make use of the data. If the data is valuable, it won’t be too hard to find people who will contribute - on a voluntary basis or at an acceptable cost.

But nothing happens if these opportunities are not created.

If someone refines the data and sells the result as a product: that’s good. That’s not an abuse of a common good. It means someone has created value that someone else is willing to pay for. And if you’re not willing to pay someone: you can duplicate their effort and compete.

We also need to talk about how to deal with people in government organizations abusing their position to obstruct access to public data. It shouldn’t be necessary to drag government agencies through court to get access to public data - that should mostly be resolvable outside the legal system - except perhaps in gray areas. There should be mechanisms, and mandates, in place to decide disputes more rapidly. A kind of data ombudsman¹ whose job it is to ensure that public data is actually publicly available, with the mandate to override other bureaucrats in these questions.

There needs to be a lot more pragmatism in making public data public.

¹ is there a gender neutral equivalent to this word?