We Have Completed The Biggest Infrastructure Update Ever!
The overall goal of this undertaking was to lower network latency.
Our engine is extremely fast, serving search queries in less than 15ms on average, even on bigger stores with hundreds of thousands of SKUs (however without color engine). This number doesn’t leave much room for further improvements.
Contrast that to the time required for a network packet to travel from the East coast to West coast and back is about 65ms. To travel from LA to Sao Paulo, Brazil and back is about 165ms. Compare that with 15ms required to serve a request.
To fight the speed of light we had servers deployed at different geographic locations for a period of time. The following are things we tried and didn’t prove to be that effective: GeoDNS services like Azure Traffic Manager and Amazon Geolocation routing policy; 3rd party Anycast-ed private cloud.
The last iteration started almost a year ago in September 2018. We applied for our IPv4 address block and ASN at ARIN. This process took almost 10 months to complete. We received our assigned block in July.
Check console command> whois 64.4.174.4
This finally allowed us to start building our Anycast network with our routing policies, and freedom to choose data centers, wherever we want or to be more precise where customers of our stores running on us are.
Anycast is a network addressing and routing methodology in which a single destination address has multiple routing paths to two or more endpoint destinations. Routers will select the desired path on the basis of number of hops, distance, lowest cost, latency measurements or based on the least congested route. Anycast networks are widely used for content delivery network (CDN) products to bring their content closer to the end user.
As we already had a distributed search cloud, switching it from GeoDNS to Anycast was fairly simple. Interesting things started to unveil after that.
1. As our network is not tied to a single hosting provider anymore, we have a better choice where we can locate our servers. A couple of things emerged following this: we added a new server to Sao Paulo to better serve our South American clients. We moved one server from San Jose to Los Angeles as our analytic shows that our LA datacenter has better network connectivity and receives more traffic. And finally, we deployed a server in Dallas, TX. This ends up with POPs in 5 regions: New Jersey, Virginia, California, Texas, Sao Paulo.
2. We improved our analytics tool so now we capture network roundtrip time from end-customer to our search service and back. This gives us insights on our overall network performance and points on places for improvement. As a result, we’ve come up with a methodology on how to choose data centers to host our servers. We are going to repeat it from time to time – our current plan is to run an experiment every 3 or 4 months.
3. As we re-deployed the servers, we also took the opportunity to upgrade the hardware. Now a majority runs on 3.8Ghz CPUs instead of 2.4Ghz and with better SSDs. This means search queries are served faster.
As far as software is concerned, we also implemented a couple updates to lower latency.
4. We moved from gzip compression to Brotli compression for static content, which saved around 28% of network transfer time. That’s because Brotli gives a better compression ratio:
· Javascript files compressed with Brotli are 14% smaller than gzip.
· HTML files are 21% smaller than gzip.
· CSS files are 17% smaller than gzip.
5. We re-configured front-end servers to support HTTP/2.
HTTP/2 (originally named HTTP/2.0) is a major revision of the HTTP network protocol used by the World Wide Web. The primary goals for HTTP/2 are to reduce latency by enabling full request and response multiplexing, minimize protocol overhead via efficient compression of HTTP header fields, and add support for request prioritization and server push.
This also improves overall speed and responsiveness.
Why not just use Cloudflare?
As we use a great deal of dynamic content, Cloudflare solution that caches information won’t let us achieve the level of speed we want. The amount of content is extensive because we try to personalize customer’s journeys, so every query is responded with different search results.
If we used Cloudflare servers for static content we’d need to use two domain names – one for static content, and one for dynamic. That would lead to a more complex configuration with no substantial benefits.
What’s next?
We are keeping an eye on HTTP/3 support, waiting for its support in mainstream software.
We found that mobile traffic differs dramatically from landline traffic in performance and characteristics. We also found that depending on origin, ISPs search requests may go to a sub-optimal location. We have to enhance our reporting tools to understand which ISPs give us the most hassle first. Then, we are going to re-run our datacenter selection methodology from time to time.