Quickbooks Search Platform — Our Journey to AWS

Published in

QuickBooks Engineering

9 min readOct 10, 2018

“The journey is what brings us happiness, not the destination”
— Dan Millman, Way of the Peaceful Warrior

In all honesty, reaching our destination to be fully in the cloud made us just as happy as the journey itself, probably more so 😃. So what is this journey I so fondly talk about?

The Beginning

After a lot of experiments and “Follow Me Homes” with our customers, we launched QuickBooks Universal Search, with only one goal in mind — to improve our search functionality and power type-ahead search across the Product.

This was by no means a small feat — We had to design a scalable and distributed search platform that would not crumble under peak loads of incoming requests at every character typed by the user (debounced of course!). Nor should it require any major upgrades in the near future because being deployed in our own datacenters had limitations on how fast we could scale our hardware. So, we had to prepare for the future. A future where we can support this platform, while we journey towards AWS.

The Architecture

Now Pay Attention. This is where I dump all the gruesome, gory details about how complex our Application Architecture was and how intertwined our usage of technologies was just so we could scale. Sounds about right?

“Simplicity is the soul of Efficiency”
— Austin Freeman (in The Eye of Osiris)

No, actually it does not. Architectures are supposed to be as simplistic as possible. Simplicity of our architecture is what allowed us to achieve maximum Developer Productivity and with that comes faster (and more frequent) releases.

Below is the high level design of what our architecture looked like when we were in our own datacenters (from here on, we will refer to datacenters as IHP — Intuit Hosting Platform — consisting of 2 datacenters).

At any point in time, Quickbooks would be active in one datacenter, thus publishing events on the message bus. The messaging cluster would forward the messages to 2 queues, one per each search cluster in each datacenter. Point to note — we had to do this because Elasticsearch did not provide Data Center Replication out-of-the-box. We basically used Eventing Architecture to feed both our Search Clusters in realtime, essentially being capable of Active-Active DR strategies. Pretty Standard Stuff.

We chose Elasticsearch as our weapon of choice to power type-ahead search queries. This Lucene backed search engine is quite powerful and is really the de-facto industry standard for enterprise search. Plus it scales quite well if proper design patterns are implemented based on the nature of the application (Caution: I cannot stress enough — Do your due diligence upfront on how you plan to index, store & query your data because as your business grows, your original designs (to some extent) will determine how well you scale. If you don’t plan well, you will keep reindexing your data and rebalancing your cluster all day long!). Without going too deep into Elasticsearch design, I would just like to share a brief overview about how the data is organized in indices/shards. This is important to discuss because this is what allowed us to scale at Terabyte scale with sub-second latencies in our queries.

Glossary of terms | Elasticsearch Reference [6.4] | Elastic

Analysis is the process of converting full text to terms. Depending on which analyzer is used, these phrases: FOO BAR…

www.elastic.co

Our Elasticsearch Cluster has 100 indices with the naming conventions of “search-XX” where XX is just a representation of our bucketing strategy. For example, company with realmId 12343564 will belong to index “search-35”. Because our queries are always constrained within a company, there will never be a cross-index query. This already gave us a huge boost in our performance.

Each index has 15 primary shards with 1 replication. To further tune and boost our performance we used Elasticsearch Routing to index all documents specific to a company in a specific shard within an index, using realmId as the routing value. With the understanding that our queries are strictly bounded within the context of a company, this bolstered the idea of each query specific to a company will always route the same shard, further enhancing our performance

“Your most unhappy customers are your greatest source of learning”
— Bill Gates

What’s the take-away from this? — Know Your Customers, their needs and their business! This is where “Follow Me Homes” with our customers really paid off. We spent some serious time learning how our customers search their data, what specific data do they really care for and what kind of search capabilities do they expect on their fingertips with a Modern SaaS based product such as QuickBooks. Technologies are flexible enough to tune at will. It all boils down to how you tune them to your customers’ delight!

Talking about Scale — With this simplistic architecture in place, we currently ingest ~50 million business transactions and handle ~2 million queries per day.

Our Application topology is quite simplistic too. We have 10 stateless application servers fronted by a Load Balancer (round-robin routing), which eventually is fronted by Gateway. The application servers connect to Elasticsearch Cluster for feeding-in and querying-out the data.

Decisions Decisions …

When we evaluated moving to the cloud, a lot of thought was put into how we should arrive on a decision. Here are the guiding principles that we agreed on to determine our choice.

Migration Simplicity
Security
We rather not manage the cluster if it can be done for us (Upgrades, Scaling etc.)
Maximize Developer Productivity
Blazing fast release cycles

AWS Elasticsearch Service came out on top — That whole “we will securely manage it for you” vibe was a perfect sell. We as a team would rather concentrate on solving customer problems. Not to mention, all of the other solutions that AWS provides to manage our application lifecycles, was just a cherry-on-top. The whole AWS ecosystem is quite a well oiled machine and there was no reason for us to shy away from it, but rather just embrace it all. That being said, we did have to make a trade off on one of our guiding principles — Migration Simplicity.

IHP → AWS

In the 6 months we have been supporting our usecases from those clusters, there were major upgrades to Elasticsearch and as you can imagine, with major version upgrades (6.x.x) came even major breaking changes. Some of these breaking changes were important enough to address.

To be fair, we knew this was coming for some time now. So, even though 5.x.x supported multiple mapping types, and we did use it heavily, the actual mappings were kept flexible enough for us to easily port to the new world (6.x.x). But change in mappings means fully re-indexing your data. We have close to 10 TB of data with about 8 billion documents. Re-indexing wasn’t going to be easy. It needed to be thought through.

Another major huddle for us to overcome was moving away from the Elasticsearch Transport Client and embrace the Elasticsearch Java REST Client. This was partly because with AWS Elasticsearch Service, we cannot use the transport client as AWS only supports HTTP on port 80, but does not support TCP Transport (which is what Transport Client requires).

Improvise. Adapt. Overcome.

The team have been going back and forth on how we can support Search in IHP and Search in AWS while having the same codebase. On one hand we didn’t want to upgrade our IHP infrastructure to 6.x considering it will be a waste of time. On the other hand, we didn’t want to move to AWS with 5.x which was already an older version and we will have to upgrade anyways. Whichever route we choose, there was code change regardless. We needed to choose the path of least resistance.

I remember having a distinct phone call with our team lead one fine evening while crawling through peak traffic of Bay Area. After discussing it for way too long, he said (and I am paraphrasing) — “Let’s just do it, man. Let’s keep it simple”

The essence of the discussion was quite simple, yet beautiful. We will deprecate what we have in IHP, add no new features. At the same time, we will port over all the code to a new repository, experiment with some new CICD Frameworks and accelerate our journey to Modern SaaS (more in another blog). We even simplified our data migration strategies. After discussing a lot with AWS Solution Architects and working through the problem, we realized that we didn’t want to invest in writing one off tools (AWS Lambdas, Kinesis Firehose). We improvised yet again and kept things simple. We decided to use the tools that we already had at our disposal, without writing any one-off tools. 2 things are for certain —
1) We needed new indices in AWS anyways because with 6.2 we cannot use mapping types.
2) Because these will be new indices with new mappings, we will have to re-index the data.
So here’s how we planned our data migration

Universal Search Data migration Strategy

We will deploy a fresh new Elasticsearch Cluster in AWS with 6.2. We will also deploy application code compatible with Elasticsearch 6.2.
We will open our flood gates of active incoming messages and start indexing new data in new indices (search-new-XX).
In the meantime, we will get the latest data snapshot backed-up from IHP to S3 with all indices (search-XX).
Thanks to the ability of restoring snapshot from S3 (and 5.3 snapshot restore compatible with 6.2 cluster), all the old indices (search-XX) are now restored in the new AWS Elasticsearch Service Cluster.
The best part? the old and new indices now co-exist in the same cluster! With this in place, now we just use the Reindex API to reindex all the documents from old indices (search-XX) to new indices (search-new-XX). We also use entity version type as external so that newer versions of documents (within the time delta of us reindexing and new data coming in through messaging) are not overwritten.

BOOM !

And just like that, Universal Search in AWS was open for business. And now it was time for experimentation & testing. We excelled at all the possible OpMechs in our new DevOps World. We have all-round monitoring, alerting, CI/CD and Testing Automations (One more Blog for another time). Our checkin → production window has reduced to under 1 hour, with all the right guard rails. I’d continue on how we marveled at our own magnificence, but I guess it’s time to move on 😏.

What’s Next

We aren’t done yet. Being deployed in AWS has opened up a lot more options for us in terms of scalability and we would like to seize this opportunity to experiment more and fail fast if need be. We will continue to monitor Voice of Customers and User Voice channels for any feedback on how to improve the quality of universal search

Final Thoughts

Just like all other products, Universal Search had it’s ebb and flow. But the team’s Customer Obsessed culture, backed by intense desire to learn, simplify and innovate, is what really made it successful. I am extremely proud of all the team members who poured their hearts out getting this feature into Customers’ hands. It was one of my most memorable journeys of delivering a widely used feature with such an Awesome Team. Kudos & Thank you!

We have achieved a huge milestone, but this is just the beginning. Stay tuned for more !

Acknowledgements

The QBO Search team in the Small Business unit was extremely instrumental for the success of this journey. Special thanks to Akbar Rangara who reviewed and added content to the document.