How Blockchains Become Great Big Garbage Patches for Data
CoinDesk columnist Nic Carter is a partner at Castle Island Ventures, a public blockchain-focused venture fund based in Cambridge, Mass. He is also the cofounder of Coin Metrics, a blockchain analytics startup.
In the Disney Pixar movie “WALL·E,” the eponymous robot hero trundles around an abandoned Earth, methodically compacting mounds of old garbage. The planet had become barren and sterile, covered in the residual detritus of rampant consumerism.
If we’re not careful, most public blockchains will share the fate of WALL·E’s Earth, destined to become deserted repositories of ancient garbage: not with physical garbage but junk data, irrelevant, anachronistic and disused.
There’s a lot at stake. Blockchains that welcome the “highly available generic database” use case will suffer one of two dismal fates: either nodes become practically impossible to run long term, or node operators will discard data, weakening immutability promises.
While Bitcoin’s approach to constricted block space (and consequently higher fees) dis-incentivizes the insertion of arbitrary, non-transactional data on chain, other competitors insist on low fees, effectively subsidizing marginal usage. This has had visible effects already – and introduces long-term risks that will have to be reckoned with.
To understand why using blockchains for storing arbitrary data is a bad idea, let us consider them in the abstract. A blockchain manages the continuous auction of block space to the public in exchange for fees (and a subsidy). Miners can claim these fees in exchange for constructing and ordering blocks. Transactors tolerate these fees because the blockchain generates strong settlement assurances that can’t be found elsewhere.
The quality of these assurances is largely a function of security spend, which is itself constituted from fees and the subsidy. Fees arise from the interplay between a bounded quantity of block space and demand to use that block space. Lastly, remember that node operators are the ones bearing the costs of data being added to the chain. Any data added today is effectively an externality that node operators have to tolerate in perpetuity.
So is a payload of data – a transaction – an asset or a liability? It depends. I’d venture that a transaction is an asset to the blockchain if two conditions hold:
That transactions should contribute to security spend is obvious. That they should involve currency is not. In effect, there’s a maturity mismatch between the way people use blockchains and their long-term maintenance costs. Public blockchains are intended to store data in perpetuity; they achieve this impressive feat by replicating the database across many nodes. However, as mentioned, they rely on the willingness of node operators to ingest, store and serve up this data forever. If transactions impose a significant cost relative to their contribution to the security of the blockchain, they are a net negative.
So I’d venture that data inscribed on-chain is an asset to the extent that it’s economically relevant and will contribute value to the system by inducing users to transact. It’s a liability to the extent that node operators must ingest the data, validate it and store it. If the data is a UTXO, it’s highly likely to be relevant in the future: Transactors eventually spend their coins. If it’s spam relating to an airdrop for a transient token, it may never be relevant again. And what node operator wants to foot the bill for terabytes of irrelevant, uneconomical data?
To be clear, the Bitcoin-like blockchain model isn’t perfect. Bitcoin depends on the willingness of node operators to download and propagate blocks without compensation, a bit of an oddity in a system that is otherwise strongly driven by free market incentives. To account for this, Bitcoin developers have been careful to limit the amount of block space available such that node operation is still possible on commodity hardware. Depending on how you count it, the entire blockchain is still only about 274 GB, even after 11 years of operation. Levying an ongoing tax on storage, as the state rent proposal aims to do for Ethereum, is another potential solution to the problem. Other blockchains, in their eagerness to differentiate from Bitcoin and its purportedly high fees, created a zero- or low-fee environment.
But, of course, fees serve as a sort of financial proof-of-work. They require transactors to insert only information to the chain they consider worth paying for. This makes it more expensive to generate spam and discourages wasteful usage modes. Since demand for perpetual, highly available storage is almost infinite (wouldn’t you create a highly-available, perpetual cloud backup of your 10 TB torrent collection if storage was essentially free?), it’s likely low- or no-fee chains will be filled with junk data, given enough time.
Predictably, this is what has happened. Reduce the clearing price for inclusion on a replicated, highly available database to zero and expect opportunistic spammers who can take advantage. Numerous examples abound. A huge fraction of transactions on Stellar relate to a service called Diruna that apparently incentivized users to spam the blockchain. Diruna appears to be defunct now. Its on-chain footprint lives on, though, effectively indelible. Bitcoin Cash and Litecoin bear the imprint of an application called “Bitcoin Aliens,” a tool that pays users minuscule amounts for viewing ads.
Something called “Blitz Ticker” accounts for up to 50% of BCH transactions on any given day. Its purpose? Inserting market data onto the blockchain. Ethereans may remember a period in mid-2018 when the biggest consumer of gas was a mysterious exchange called FCoin, which ran a competitive token listing scheme that incentivized individuals to spam the blockchain. A pattern emerges: private gains, public externalities. FCoin is insolvent now, but its impact will be felt on Ethereum forever because token transactions cannot easily be disentangled and pruned out.
Bitcoin’s approach to the issue was to designate an opcode to act as a kind of sink for non-transactional data. Previously, people were encoding data in addresses directly, which were mostly indistinguishable from normal transactions. Thus OP_RETURN was chosen specifically to handle arbitrary data, so it could be identified and pruned out by nodes with little difficulty.
As it turns out, Bitcoin’s protocol is designed to cultivate its own UTXO set. OP_RETURN saw significant usage from Omni (which powered Tether transactions) and Veriblock, but little else. The impact on the blockchain is fairly low; Strehle and Steinmetz find that OP_RETURN data in Bitcoin accounts for around 3% of the overall blockchain data overhead. Should it grow, however, nodes would have the option to discard OP_RETURN outputs altogether, as they are provably unspendable and not relevant from a transactional perspective.
Ultimately, node operators on blockchains that are burdened with lots of non-transactional data will have to consider periodically pruning their datasets. This is convenient but trades off against the desirable quality sought in blockchains of data immutability and availability. If validators/archivists can effectively perform eminent domain by arbitrarily deleting data users consider important, their assurances on that chain are effectively nonexistent. So we have a situation where the discarding approach stands in direct opposition to a desirable quality of public blockchains, which is making data available to users in perpetuity.
The issue is that if even one single entity has an interest in the existence of some otherwise-nuisance data, validators cannot eliminate that data without effectively depriving this individual of their property. But there’s an enormous asymmetry here: One economically minded individual can essentially compel all present and future users of the blockchain to ingest their transaction. The alternative is the unpalatable choice to disempower commodity nodes and opt for a model where only the largest nodes survive.
This tension is unresolvable unless the available data slots are strictly bounded and fees are employed to meter blockchain usage. Open the gauge and deal with either data loss and user frustration, or unbounded state growth and impossible validation.
Far from making blockchains more convenient, unlimited block size and zero fees render them less reliable and virtually guarantee either the long-term loss of supposedly immutable data, or require the compromise of decentralization at the node level.
Thanks to Antoine Le Calvez, David Vorick, Lucas Nuzzi and Takens Theorem for their feedback.