Can I trust this code I found on the Internet?
By Bjørn Borud
During a meeting today someone raised an interesting question about how we vet third party dependencies in software. This is not only a good question, but it pokes at something all programmers are guilty of at some point: including libraries without properly vetting them.
The only answer I could really come up with was “it’s a judgement call”. Because it is really hard to describe what I do to vet third party code. There isn’t one method I follow strictly. It depends which is the answer nobody wants to hear. Ever.
Some years ago I worked in a company where, at least part of the company adhered to a relatively strict quality regimen: all code has good enough tests, all code is properly documented, and all code is reviewed before merged into the main branch.
At least those were the standards we tried to achieve. And they weren’t unreasonable standards.
Yet, there was a tendency to trust third party code more than we trusted our own at times. Which didn’t make sense. As one person on the team sarcasticly put it:
“Oh sure, we’ll unblinkingly trust any crazy rubbish random people on the internet cobbled together – but stuff we’ve written ourselves, under a known quality regimen, is suspect!"
Which I suppose most developers have felt at some point.
Vetting code
The only real way to vet code is really to read it. To be fair: that is not always feasible. It takes time to read and understand someone else’s code. And some things you might want to make use of may be hundreds of thousands of lines long. At least you should have a look at the code and see if it adheres to the quality standards you aspire to live by. And there are usually ways to tell short of ingesting the whole codebase. Read the parts of the code that seem to matter the most. Does it look reasonable? Look at the structure. Are there clean, well-defined interfaces between the different parts of the code? Are they documented? Do they need to be? Does it build? How? Does it have tests? Do they run cleanly? Do they run quickly? Will it work on all platforms you need it to work on? What dependencies does it have? Are they sane (or do you need to descend recursively into those as well)? Etc etc.
Copy and paste?
There is always the possibility of copying the parts of the code (license permitting) that you need. If indeed, you can separate it out cleanly and copy it. For some reason people see this as suspect. You are supposed to either include the whole thing, or none of it. Copying and pasting is somehow dirty.
That’s a bit rigid, isn’t it? A lot of the things you often consider settled issues are things you have probably given some thought in the past. Or perhaps someone has hammered into your head that “this is the way we do it”. Rules of thumb are nice shortcuts. But occasionally you might want to think about whether or not something is useful in the current context.
A lot of programmers write code as if for an audience. And they want to avoid being ridiculed by that audience. Which can lead to obsession with orthodoxies.
If you only need a hashing function for some non-cryptographic use, you may not need to include the entire half-a-million-lines-of-code cryptographic library it came from. You might be able to copy just a single function (after having checked the license) and drop a comment in the source indicating where you took this function from. That’s an entirely legitimate thing to do.
Roll your own?
Perhaps you should write the code yourself? Most of the time the answer will be “no”, but not always.
If you rewrite things that exist (in some form) this will put you at risk of being accused of cultivating a Not Invented Here mindset, but if you actually do think this is the best way to go about it you shouldn’t be afraid to actually go down this route. Regardless of what people who haven’t spent time in your shoes say.
Reputation
Codebases that are well known tend to have some form of reputation. And there are factors that can tell you something about what other people think if you are new to a codebase. For instance, if a project on Github ticks a bunch of the right boxes (has lots of stars/followers, seems to have a reasonable commit log, has consistent and recent activity, issues seem to be handled professionally etc) that counts in its favor.
But these factors are only an indication – not a guarantee. There’s lots of code that is terrible, yet is maintained enthusiastically and with seeming professionalism.
If, on the other hand, you find a piece of code nobody cares about, you might want to put it under more scrutiny. For instance by reading through the source, and perhaps making copies of it.
During the same meeting referred to above I jokingly pointed out that the project whose code we were going through has one such dependency github.com/borud/broker - which at the time of writing has two stars (wow, someone besides me actually uses this?). On the other hand: what it does is so trivial, you can actually read through it and probably understand all of it in 5 minutes. And the license is very permissive if you wish to copy the code.
Who gives something its reputation?
If you place stock in the reputation of a piece of code, you sometimes have to understand who creates that reputation. Who are the proponents of the code. Because they may not have the same goals or care as much about quality or speed or consistency or correctness as you do.
We recently discovered that a library we use in some applications was severely lacking in one key respect. It was slow. “Slow” isn’t really a metric, so let’s elaborate a bit.
You can usually trade speed for convenience or quality, but you have to be aware of how far you’ll stretch. If something is really good I might be willing to give up perhaps 10-20% performance. This library was three orders of magnitude slower than it ought to be. It is so slow that if I express it in percentages there will be enough zeroes in the number to force you to translate it into “how many times slower” to usefully understand it.
And on closer inspection, it was slow because it wasn’t particularly well designed or written. Yes, it offers a lot of functionality that people might find useful, but at a price that wasn’t realy as interesting anymore.
It had a good reputation among people who don’t actually know any better.
So where are we?
If you want to hedge your bets, the best answer to the question in the title is “probably not”. In the general case. But then again, would you ask other people to trust your own code without vetting it themselves? Probably not. If you’re realistic you probably don’t trust your own code all that much either.