CNK's Blog

Hostnames and Aliases

Our multitenant Wagtail setup has made a lot if things smoother compared to the multi-instance setup it replaced. But there are a couple of situations that were easier in the old setup. For example, how do we set up a new site while the existing site is still live. Or when organizations change their names and want their url updated to the new acronym - but still want the old one to work.

Site Aliases

Our solution to these problems is twofold. First, create all sites as subdomains of the installation’s hostname; that takes care of building a new site while the current one is still live. But then we need to be able to assign the real name to the new site. To take care of that, we have a SiteAlias model to associate additional names with a site.

So if the hostname for our install is sites.example.com, then we create new sites as xyz.sites.example.com, etc. We have a wildcard DNS mapping and a wildcard SSL certificate for *.sites.example.com, so once we create a site, it is available on the public internet as https://xyz.sites.example.com. When the customer is ready for this site to be live with their preferred names, e.g. xyz.example.com, we can add the new name to our SiteAliases, request a DNS mapping and a new SSL certificate. Then the site is live with both names. Below is the code we use for mapping requests to sites.

    def match_site_to_request(request):
        """
        Find the Site object responsible for responding to this HTTP request object. Try in this order:
        * unique hostname
        * unique site alias

        If there is no matching hostname or alias for any Site, Site.DoesNotExist is raised.

        This function returns a tuple of (match_type_string, Site), where match type can be 'hostname' or 'alias'.
        It also pre-selects as much as it can from the Site and Settings, to avoid needless separate queries for things
        that will be looked at on most requests.

        This function may throw either MissingHostException or Site.DoesNotExist. Callers must handle those appropriately.
        """
        query = Site.objects.select_related('settings', 'root_page', 'features')

        if 'HTTP_HOST' not in request.META:
            # If the HTTP_HOST header is missing, this is an improperly configured test; any on-spec HTTP client will include it.
            raise MissingHostException()

        # Get the hostname. Strip off any port that might have been specified, since this function doesn't need it.
        hostname = split_domain_port(request.get_host())[0]
        try:
            # Find a Site matching this specified hostname.
            return ['hostname', query.get(hostname=hostname)]
        except Site.DoesNotExist:
            # This catches "no Site exists with this canonical hostname", now check if 'hostname' matches an alias.
            # Site.DoesNotExist will be raised if 'hostname' doesn't match an alias either
            return ['alias', query.get(settings__aliases__domain=hostname)]

So all problems solved, right? Not quite.

Preferred Domains

Now that we have multiple domain names mapped to the same site, we have an SEO problem. Ideally each site should have one and only one canonical name, so we designate one of our aliases as the preferred domain name - and then have a site middleware that redirects requests to the https version of that name.

    class MultitenantSiteMiddleware(MiddlewareMixin):

        def process_request(self, request):
            """
            Set request._wagtail_site to the Site object responsible for handling this request. Wagtail's version of this
            middleware only looks at the Sites' hostnames. Ours must also consider the Sites' lists of aliases.
            """
            try:
                # We store the Site in request._wagtail_site to avoid having to patch Wagtail's Site.find_for_request
                match_type, request._wagtail_site = match_site_to_request(request)
            except Site.DoesNotExist:
                # This will trigger if no Site matches the request. We raise a 404 so that the user gets a useful message.
                # We provide the default site as request._wagtail_site, though, just in case a template(tag) that gets
                # rendered on the 404 page expects Site.find_for_request() to actually return a Site (rather than None).
                request._wagtail_site = Site.objects.get(is_default_site=True)
                raise Http404()
            except MissingHostException:
                # If no hostname was specified, we return a 400 error. This only happens in tests.
                return HttpResponseBadRequest("No HTTP_HOST header detected. Site cannot be determined without one.")

            # Grab the site we just assigned, using the Wagtail method, to ensure the Wagtail method will work later.
            current_site = Site.find_for_request(request)

            # Determine how the user arrived here, so we can redirect them as needed.
            arrival_domain, arrival_port = split_domain_port(request.get_host())
            # If an empty port was returned from split_domain_port(), we know it's either 80 or 443.
            if not arrival_port:
                arrival_port = 80 if not request.is_secure() else 443

            # If a user visits any site via http://, and we can be 100% sure that an https-compatible version of that site
            # exists, redirect to it automatically.
            if not request.is_secure() and is_ssl_domain(arrival_domain, request):
                target_domain = get_public_domain_for_site(current_site)
                target_url = f'https://{target_domain}{request.get_full_path()}'
                logger.info(
                    'https.auto-redirect',
                    arrival_url=f'http://{arrival_domain}{request.get_full_path()}',
                    target_url=target_url
                )
                # Issue a permanent redirect, so that search engines know that the http:// URL isn't valid.
                return HttpResponsePermanentRedirect(target_url)

    def get_public_domain_for_site(site):
        """
        Returns the public-facing domain for this site
        """
        return site.settings.preferred_domain or site.hostname


    def is_ssl_domain(domain, request):
        """
        Returns True if the given domain name is guaranteed to match our SSL certs after being run through
        get_public_domain_for_site().
        """
        # We know the domain matches our SSL certs post-get_public_domain_for_site() if one of two things is true:

        # 1. The current site has a preferred_domain set. We know this implies SSL compatibility because the Site Settings
        # form prevents a non-SSL-compatible preferred_domain from being set.
        current_site = Site.find_for_request(request)
        if current_site.settings.preferred_domain:
            return True

        # 2. If the given domain matches any of our SSL wildcard domains in the way that SSL counts as a match,
        # e.g. *.example.com matches xyz.example.com but not www.xyz.example.com.
        for wildcard_domain in settings.SSL_WILDCARD_DOMAINS:
            if re.match(rf'^[^.]+\.{wildcard_domain}$', domain):
                return True

        return False

Relative urls

So now all requests are going to be going to the preferred domain right? Sadly, no. Despite all advice to use page and document choosers when creating links within a site, our content editors often copy and paste links from the browser’s address bar instead. Unfortunately, that often leads to links with the xyz.sites.example.com domain name - particularly for sites that are built while an old site is still live. So we have code that allows us to always store relative urls for links within a site. When a page is saved, a new Revision is created. Before that revision is saved, we convert any links to “our” domains into relative links and store that instead.

    @receiver(pre_save)
    def postprocess_links(sender, instance, raw, using, update_fields, **kwargs):
        """
        To ensure that copy-pasted URLs always point to the correct site, we remove the scheme and domain from any URLs
        which include the site's hostname or any of its aliases, converting them into relative URLs.
        """
        if sender == Revision:
            domains = get_domains_for_current_site()
            content_string = json.dumps(instance.content, cls=DjangoJSONEncoder)
            updated_string = domain_erase(domains, content_string)
            instance.content = json.loads(updated_string, object_hook=_decode_revision_datetimes)


    def get_domains_for_current_site():
        """
        Returns the list of domains associated with the current site. If there is no current site, returns empty list.
        """
        request = get_current_request()
        site = Site.find_for_request(request)
        alias_domains = []
        if site:
            alias_domains.append(site.hostname)
            try:
                alias_domains.extend([alias.domain for alias in site.settings.aliases.all()])
            except ObjectDoesNotExist:
                # This is a generic "except" because core can't know which settings class's DoesNotExist might get thrown.
                pass
        return alias_domains


    def domain_erase(domains, text):
        """
        Removes all instances of the specified domains from the links in the given text.
        """
        # Do nothing if given an empty list of domains. Otherwise, we'll get mangled output.
        if not domains:
            return text

        # Create a regular expression from the domains, e.g. (https?://blah\.com/?|https?://www\.blue\.com/?).
        escaped_domains = [f"https?://{re.escape(domain)}/?" for domain in domains]
        regex = '(' + "|".join(escaped_domains) + ')'
        # Replace each match with a /. This regex will convert https://www.example.com/path to /path and
        # https://www.example.com into /. Since this is just about converting local URLs, that's the appropriate conversion
        # for path-less ones.
        replaced = re.sub(regex, '/', text)
        return replaced