Building scalable data systems

In my tenure at HubSpot I’ve been on teams that have built and rebuilt various data systems and every time we’ve tried to construct “scalable” solutions. We’ve hit the mark on some projects and been wildly off on others. I’m on another such project and a few things I’ve learned seemed important enough to share, or at least write down to remind myself later.

Make an API. Whether you’re creating a new RESTful HTTP service or using ProtoBuf/Thrift it’s worth it to put a layer between your raw data source and your consumers. By creating a layer between your consumers and your data system, you can insulate client code from all sorts of issues related to data management.  You can add cache layers, shard databases, even switch the entire data storage system behind the API while the client stays blissfully unmodified. All of these things put you on the hook for making a fast and reliable API, but the benefits of being able to swap moving parts behind the scenes are invaluable. Furthermore, by making an API for your data, you also make it easier for other developers (internal and external) to make use of your data. Internal consumers will be able to build tighter integration between systems within your product(s). External developers can start to use your systems data to build plugins and add-ons that add value for your customers in ways you hadn’t thought of yet. Of course with all these other developers and teams accessing your data now…

What you expect to happen is not what will happen. In rebuilding some other parts of the Lead storage system my team at also built a RESTful API to modify and retrieve lead data. We thought it would receive relatively moderate usage internally and very light usage externally. A pair of load-balanced tomcat servers could easily handle the 10,000 requests per day we were expecting. Instead we saw internal usage spike well over 100,000 requests in the first few weeks, forcing us to modify and add capacity to the “scalable” system in a variety of ways. Now over a year later we have over 200 HubSpot customers and dozens of internal customers tallying over 1 million requests and serving up a couple gigs worth of data per day. This isn’t what we had expected for this API, and as such it’s got a lot of rough edges, some which have been sanded down, others which still pose hazards to internal and external users.

Over-engineering vs under-delivering. If your system takes a year to design and another to build a first version and its “infinitely” scalable, your business better be capable of making that investment and continuing to run in the meantime. If you’re like just about every other company smaller than Google or Microsoft, your development team can’t wait that long. Google’s BigTable is an amazing work of computer science and software engineering, but I doubt few established companies and even fewer startups can sustain the rest of their business while waiting on a team to deliver such an amazing work. The other side to the coin is under-delivering.  Designing to your current volume of traffic and requests is a sure way to waste your time.  At the very least plan for double, if not an order of magnitude more, requests/traffic/data than your current system sustains. If you’re building something entirely green field, consider what would happen if you opened up access to this data through an API layer to external developers. Would you see 10 requests per day? 100? 100,000? Think big enough to know that you won’t be up until 3am every night patching your system in a desperate attempt to keep it efficient and high-performing as usage grows.

Experiment to solve problems, not for the sake of experimenting. There’s so many cool new NoSQL or Key-Value data storage projects out now that want to replace your relational database. In the world of developing data-driven web applications, there are a lot of applications of this new technology. However, it doesn’t mean that every project is well suited for this model. Building an in-browser IM client? Key value stores sound like a great way to go for storing the messages for quick retrieval. Building a white-label inventory management console? It could still work ok, but you’re in the gray area. That new CRM for non-profit dog walking businesses? You’re sounding more like you’re in the relational model’s sweet spot. Shiny new things are shiny, and new, but many successful technologies are successful because they’ve been proven to work very well for a wide range of applications.

This is by no means the blueprint to successful data system design, its just the start of some guidelines I intend to use as I build new, bigger systems to replace the ones I’ve already built. We’re working on some awesome new stuff at HubSpot and I can’t wait to post more about what we build. I’m sure we’ll try some new stuff that won’t fit either because we don’t sufficiently understand it or because its ill suited for our data model. I’ll try keep this blog updated as we come across cool new findings (of which expect there will be many).

This entry was posted in Technology and tagged , , , . Bookmark the permalink.