Scaling Subdomains with Redis and ZeroMQ

You know those nice web apps that allow customised subdomains per customer, i.e. mycompany.webapp.com? Nice to have, sure, but there are challenges in keeping this feature performant in a scalable server farm and it’s non-trivial, as I discovered.

To determine which customer organization a web request is for based on the subdomain you could make a call to the service layer to determine the organization ID for the subdomain, but that is massively inefficient – it is a guaranteed database round-trip per web request. To avoid this a cache server is the answer (such as Redis or memcached). Here is the configuration I use:

wtier

You might wonder why a Redis instance per web server instead of just having one shared Redis instance. There’s two main reasons for this:

  1. Since this is a per-request cache fetch I want it to be as fast as possible to increase throughput, so accessing Redis on the same server will be faster than a call over the network.
  2. If one Redis instance is used per server then you are more likely to get a cache hit and so minimize the need to go to the database. Note that the configuration is important for this to work as intended (e.g. use one of the LRU options for Redis maxmemory-policy, and configure the load balancer to use the Source IP algorithm or better.)

The web app on each server is configured to handle the wildcard domain (*.webapp.com), and a request handler is used to query the Host Header subdomain and determine if there is a cache entry for it. If there is no cache entry, a call to the service layer/database can then be done and stored in the cache. Once it is available, the web server can use this for whatever it likes. A common example is to use different or custom UI themes per organization, and so the web server would return different HTML or CSS links depending on the customer organization. With the necessary details cached on the web server, it can do its UI rendering tasks without having to call the service layer.

Sounds good, right? Yes – except for one tiny problem. Anytime a cache is introduced to an architecture it raises the issue of keeping the cache in-sync with the authoritative data store (i.e. the database). To put it simply, if a customer changes the theme they want to use and saves that, the web server cache does not yet reflect the change and so will continue to deliver the old theme until the customer data is evicted from the Redis cache. The customer, having changed theme, will probably then keep making requests, refreshing the page, wondering why the theme hasn’t changed – because each requests is keeping the old data in the cache! The problem is magnified because a stale cache entry for the customer could, potentially, now exist on more than one redis instance – the downside of using multiple Redis instances.

But there is a solution. This is the kind of problem messaging is great at solving. Enter ZeroMQ.

The task of saving the customer record update is done by the service (or data) layer, so it has the responsibility of informing any component that needs to know the customer data is now potentially out-of-date. But how does the service layer know what components to send the message too? (There could be multiple web servers on-line caching the data). The answer is that the service layer server does need to know. It’s not its responsibility. It just needs to send the message – fire and forget.

ZeroMQ is a lightweight messaging option. We could use something like RabbitMQ which can be configured for guaranteed end-to-end messaging, etc., but if the message being sent isn’t mission critical you can decide to trade reliability for performance. ZeroMQ is blazingly fast. MSMQ also is slower and configuring and testing it is a bit more of a pain than using lightweight, embedded messaging components like ZeroMQ.

To handle the notification to multiple web servers I use the pub-sub messaging model. Basically, one web server instance (the primary server) can be set-up as the messaging hub. Yes, it is just one point of failure, but again, these messages aren’t mission critical. You could use a more elaborate message broker set-up with redundancy and message storage but that means trading performance. Let’s look at the ZeroMQ pub-sub implementation in practice.

btier

We’ll use the Pub-Sub Proxy Pattern to handle the registration of web servers and forwarding of messages to them. As a web server comes on-line, it registers as a message subscriber on the XPUB socket on the primary web server (which is configured to listen). When a service tier server publishes a change message the NetMQ proxy (or hub) sends the message on to all subscribers. Each subscribers simply checks the contents of the message to see if the customer id is one it is holding in its Redis cache. If so, it refreshes the entry immediately.

ZeroMQ is a C implementation, so you’ve got (at present) two choices for using it in .NET. You can use clrzmq which is managed DLL wrapper around an unmanaged ZeroMQ library, or you can use NetMQ, which is a native C# implementation of the ZeroMQ functionality. At the time of writing NetMQ is not yet considered production ready, so it’s your call which to use – .NET code not production ready but easier to debug, or native code that will be harder to debug and is potentially open to memory leaks.

Thankfully, NetMQ has an implementation of the Proxy pattern ready built for us.

Here is a sample of the proxy code. Typically this would be run as a separate process or service on the primary web server, or you could run it as a Task or Thread in the main web app (but there are startup/shutdown issues involved which I won’t go into here.)

private void MessagingProxyTaskFunc()
{
    //Use the common context to create the mq sockets - created earlier and stored on the AppDomain
    NetMQContext cxt = (NetMQContext)AppDomain.CurrentDomain.GetData("WB_NetMQContext");

    using (NetMQSocket frontend = cxt.CreateXSubscriberSocket(), backend = cxt.CreateXPublisherSocket())
    {

        frontend.Bind("tcp://*:9100"); //Receive published messages on this server, port 9100
        backend.Bind("tcp://*:9101");  //Subscribers will bind to this server, port 9101, listening for forwarded messages

        //Create & start the proxy and begin listening for published messages
        NetMQ.Proxy proxy = new NetMQ.Proxy(frontend, backend, null);
                
        proxy.Start();
                
        while (true)
        {
            if (taskCancelToken.IsCancellationRequested) break;

            //Blocks until message received or interupted
            NetMQMessage message = frontend.ReceiveMessage();

            //Forward message to the subscribers to this proxy
            backend.SendMessage(message);
        }
    }
}

Next we need the business service to publish the message when the customer data changes:

public Organization SaveOrganization(Organization org)
{
     //Do data store logic here
     
     ...

     if(hasOrgSubdomainChanged){

            //Get the publisher socket, created when the business service was created using:
            //NetMQSocket socket = cxt.CreatePublisherSocket();
            //socket.Connect("tcp://<Primary Web Server IP Address>:9100");

            NetMQSocket socket = (NetMQSocket)AppDomain.CurrentDomain.GetData("WB_PubSocket");
          
            NetMQMessage msg = new NetMQMessage();
            msg.Append(new NetMQFrame(Encoding.UTF8.GetBytes("ORG")));
            msg.Append(new NetMQFrame(Encoding.UTF8.GetBytes(org.PublicID)));
            msg.Append(new NetMQFrame(Encoding.UTF8.GetBytes(org.Serialize())));
            socket.SendMessage(msg);
     }
}

Finally, the code for the Message Listener on each individual web server. Again, this function needs to run as its own process/thread to avoid blocking and ensure timely response to messages:

private void MessagingTaskFunc()
{
    NetMQContext cxt = (NetMQContext)AppDomain.CurrentDomain.GetData("WB_NetMQContext");

    using (NetMQSocket socket = cxt.CreateSubscriberSocket())
    {
        socket.Subscribe(Encoding.UTF8.GetBytes("ORG")); //Subscriber only listens for certain message header
        socket.Connect("tcp://127.0.0.1:9101");

        while (true) 
        {
           if (taskCancelToken.IsCancellationRequested) break;

           NetMQMessage data = null;
           try
           {
              data = socket.ReceiveMessage(); //This blocks until message received. data is null if interrupted.

              if (data == null) break;
              else
              {                            
                 data.Pop(); //Pop first message frame - will always be "ORG"
                 //Get the next message frames which should contain the ID of organization, and the data
                 NetMQFrame frame = data.Pop();
                 string orgID = Encoding.UTF8.GetString(frame.Buffer);

                 //Check that the organization's ID is one cached in Redis. If so, refresh Redis data using 
                 //last message frame data.

                 ...

               }
           }
           catch
           {
               // Handle subscription receive error gracefully - ensure listener loop keeps running;  
           }
         }
      }
 }

There you have it – an architecture for scalable, synchronized, custom subdomains.