In a previous post, The costs of storage in the cloud, I attempted to assess the differing costs of storing various amounts of data in the cloud using Amazon S3, Dropbox and Rackspace. I ended by asking:
What is interesting is whether [those prices] can be matched or beaten by academic providers (such as Eduserv) and/or in-house institutional provision, and if so by how much?
So... let's do a quick exercise!
Supposing you wanted to build your own Amazon S3 service? How much would it cost? Could you beat Amazon's S3 prices? And if not, what economies of scale would be necessary before you could?
To answer those questions, let's forget about actual money for a moment. Rather, let's draw up a list of the things that would have to be paid for. We can fill in the numbers later.
So firstly, there's the storage hardware - the raw disks. Of course, it's not quite as simple as that. Decisions need to be made about the kind of storage you want... a fibre-channel based SAN vs. cheaper SATA disks organised into some kind of Network-attached Storage (NAS) for example. Reliability vs. cost will be the issue here. Given that we're trying to build an alternative to Amazon S3, let's use SATA disks. Then there's the chosen architecture. How resilient do you want your storage to be? Amazon offer 99.999999999% durability (which, I think, means they will only lose data 1 time in 100,000,000,000?) with no single points of failure (and a design to sustain the concurrent loss of data in two facilities)... both of which are pretty hard to compete with, so let's relax that requirement a little. Let's say that we want all data to be replicated across 2 separate NAS clusters (meaning that both clusters have to fail before data is lost), running RAID 6 within each cluster (providing fault tolerance from two simultaneous drive failures). With that kind of configuration, providing 1PB of usable disk would require something like 2.7PB of raw disk. (Actually, according to Backblaze's How to build cheap cloud storage, a RAID 6 design based on 15 disks, of which 2 are parity disks, should deliver "87 percent of the raw hard drive totals" as usable space, so we should be able to do a bit better than that).
Then there's the network switching and cabling necessary to join everything together - with, again, decisions to be made about bandwidth and so on. There's also the connection to the outside world to worry about - routers, switching and a firewall for example.
Finally, there's the cost of the physical data centre space to consider, at least insofar as it represents an opportunity cost against doing something else.
That covers the initial investment (which is already non-trivial for any kind of substantial cloud infrastructure), for which costs can probably be depreciated over, say, 5 years.
There are also the recurrent costs...
Energy, both to power the disk arrays and other servers and to keep everything cool, and staffing. Operator cover (perhaps 0.5FTE per 10PB?), some developer effort at least initially (both to keep things running smoothly and to assist in integrating the cloud storage with other systems - let's say 1FTE), a service/project manager (again, let's say 1FTE) and some procurement/financial effort. Again, these are non-trivial sums of money.
So that's my shopping list. What have I forgotten? What have I over-specified?
At this stage, I'm not going to fill in the actual amounts of money - not least because the costs of storage are dropping all the time and the actual price paid will be subject to negotiation. But the shopping list:
- Disks
- Network infrastructure (switching, etc.)
- Router/firewall
- Physical space costs
- Energy
- Operator cover
- Development effort
- Project/service management
- Procurement/financial effort