Service Decomposition at Scale
The problem: QuickBooks Self-Employed has seen some massive growth over the last two years. We’ve gone from 1 subscriber in mid-2014 to over 390,000 subscribers today. With this user growth, we’ve seen some of MySQL tables grow really quick as well. In this particular case, the tables related to our mileage tracking feature have grown to over 100 million records. As we saw during our peak season last year, this led to some fairly large I/O growth on our MySQL servers. After peak season, we decided that it was time to have the mileage tracking data say goodbye to MySQL servers 👋.
First steps: So, we were ditching MySQL for a few tables— but what would replace it? We had a few important things we wanted out of our new database:
- It needed to be scale to handle our future needs — we needed a database solution that was more suited to the transactional data we were dealing with.
- It needed to be resilient — if possible, we wanted to avoid having a single point of failure in our stack. Initially, we need to be able to tolerate an AWS AZ outage, with the ability to change that to regional outage tolerance later on.
After some investigations into various databases, we decided to use the DataStax Enterprise (DSE) platform. DSE’s combination of Apache Cassandra for data storage and Apache Solr for searching/querying.
The implementation:
Once we decided on Cassandra, we got up and running in no time. By using our internal AWS “Blaster appliance”(an automated tool for setting up a DSE Cassandra Ring), along with some open source Spring Boot components and some internal SDKs, we were able to standup our new service pretty quickly. We were able to create new trips without issue. Great! Ready to ship! …… except for those 100 million records that still need to be migrated 😧
Data migration is never easy. Throw in some legacy data bugs, various timezone issues, and a change from long-based IDs to UUIDs — and it’s suddenly a lot harder than you anticipated. Since our data migration consisted of incremental CSV dumps from our MySQL database into our Cassandra ring, we didn’t have an easy way to validate that all of our data was correct in both places. To make sure we migrated all of our data with 100% accuracy, we built a data verifier into our product to ensure that the migration (and all of the code that maps from our new API to the old one) was correct. This verifier basically took the API objects for trips from the MySQL stack and compared them to the API objects that were mapped from the new Cassandra stack.
The verifier quickly turned out to be worth the time invested — we caught at least 5 or 6 critical migration bugs in our first few runs. It’s given us the confidence that we need to say that the data was migrated successfully onto the new stack.
Besides data migration, the other main challenge we ran into was certain workflows where we had assumed (correctly, until this decomposition) that certain service calls were quick. We found a few cases where we would loop through a list of trips and save each trip one by one, which was (relatively) performant on a co-hosted MySQL server, but blew up when each save had a network hop associated with it. In one case, we had an API that normally took on the order of ~100ms grow to 2 minutes. To fix these painful experiences, we changed all of these workflows to do a batch save, using 1 service call for the entire collection of trips.
Results and lessons learned
We’re still working through the last steps of the migration, but we’re serving live production traffic now 🚀!
Looking back at the process, there are a few things we could have done better / earlier:
- Have a plan for data migration up front. When we first set out to design our new service, data migration was something of an afterthought. As we got into the details, it became pretty quickly apparent that most of our time was going to spent there. Looking forward, we’re going make sure that we build + plan for data migration upfront.
- Validate the database setup with experts early and often. As we got closer to shipping our service, some consultation with experts from DataStax found some issues that caused us to do full rebuilds on our environments. While this is pretty much automated, it caused us to run our lengthy migration jobs quite a few times.
- Build data-integrity checks into the product to validate data migration. Data migration is hard, and there will be issues.