Thursday, September 27, 2012

Django app reset (with south)

When developing a new Django app, it is common to make lots of changes in the models.py module.
In order to actually test the new app, you need to update the database with the new schema.
However, Django's syncdb does not update existing tables, only adds missing ones. 

There are a few database migration tools out there, but south is by far the most common one.
South excels on small changes, like adding a field or removing a constraint. In the early stages of app development, however, you might make rapid large changes and in the same time, not care too much about the existing data in the database (for that particular app).

So in the process of your development, you might want to do some kind of "app resetting", meaning - 'drop all the tables for this app and recreate them according to the new models definition'.
As common as it seemed to me, I couldn't find a solution for that procedure in neither the Django native api, nor in the community.
The closest options are sqlclear, which prints the sql statements to drop tables for an app and flush, which actually resets the entire database. Obviously these solutions are not south-friendly.

But I wanted something that actually resets a single app and also plays nicely with south.

Enters "south_reset", a management command that is just a few south commands sewn together, but I still found it useful enough to share.

The usage is pretty straight forward, just list the app names you want to reset.
The optional "soft" flag means that the migrations are merged to a single initial migration without actually running them (just faking it),  so the data persists for the app. This is useful when you make lots of migrations and you want to get rid of the clutter, but still keep the existing data in the database.

Note that you should be very careful with this command if you are deploying the code somewhere, you might get ghost migrations.

So, with no further ado, here is the gist:

If you are not familiar with management commands, you need to put this script under a "management/commands" folder in any of your apps. more information here.

Sunday, September 16, 2012

Django Cache Chaining

In a previous post, I used Django's FileBasedCache to synchronize the static files version with the code the version. One of the advantages of this method was that there was no performance hit when a process is recycled, since the cache starts "full".
On the other hand, there was also no performance improvement over time, file system cache can work great on a local machine with SSD, but on a cloud machine where the storage might not be as close, it is significantly slower than the local RAM or maybe even than a memcached instance.

So essentially, I needed to cache my cache.

One cache to rule them all


Django supports multiple Cache Backends, so you can define a local memory cache backend and a filebased backend. What I wanted to create, is a cache backend that chains the two together.

So here is the interface I wanted:
  • get - try getting from the first cache in the chain, if exists return, else go to the next cache.
    If there is hit in a deeper cache backend, update all the caches in the chain up to it
  • set - set the item in all the caches in the chain
That turned out to be very simple to implement:


from django.core.cache import BaseCache
from django.core.cache import get_cache
from lock_factory import LockFactory

class ChainedCache(BaseCache):
    def __init__(self, name, params):
        BaseCache.__init__(self, params)
        self.caches = [get_cache(cache_name) for cache_name in params.get('CACHES', [])]
        self.debug = params.get('DEBUG', False)

    def add(self, key, value, timeout=None, version=None):
        """
        Set a value in the cache if the key does not already exist. If
        timeout is given, that timeout will be used for the key; otherwise
        the default cache timeout will be used.

        Returns True if the value was stored, False otherwise.
        """
        if self.has_key(key, version=version):
            return False
        self.set(key, value, timeout=timeout, version=version)
        return True

    def get(self, key, default=None, version=None):
        """
        Fetch a given key from the cache. If the key does not exist, return
        default, which itself defaults to None.
        """
        def recurse_get(cache_number = 0):
            if cache_number >= len(self.caches): return None
            cache = self.caches[cache_number]
            value = cache.get(key, version=version)
            if value is None:
                value = recurse_get(cache_number + 1)
                # Keep the value from the next cache in this cache for next time
                if value is not None: cache.set(key, value, version = version) # Got to use the default timeout...
            else:
                if self.debug: print 'CACHE HIT FOR', key, 'ON LEVEL', cache_number
            return value

        value = recurse_get()
        if value is None:
            if self.debug: print 'CACHE MISS FOR', key
            return default
        return value

    def set(self, key, value, timeout=None, version=None):
        """
        Set a value in the cache. If timeout is given, that timeout will be
        used for the key; otherwise the default cache timeout will be used.
        """
        # Just to be sure we don't get a race condition between different caches, lets use a lock here
        with LockFactory.get_lock(self.make_key(key, version = version)):
            for cache in self.caches:
                cache.set(key, value, timeout = timeout, version = version)

    def delete(self, key, version=None):
        """
        Delete a key from the cache, failing silently.
        """
        # Just to be sure we don't get a race condition between different caches, lets use a lock here
        with LockFactory.get_lock(self.make_key(key, version = version)):
            for cache in self.caches:
                cache.delete(key, version = version)

    def clear(self):
        """Remove *all* values from the cache at once."""
        for cache in reversed(self.caches):
            cache.clear()


# For backwards compatibility
class CacheClass(ChainedCache):
    pass

And here are the settings:
CACHES = {
    'staticfiles' : {
        'BACKEND' : 'chained_cache.ChainedCache',
        'CACHES' : ['staticfiles-mem', 'staticfiles-filesystem'],
        'DEBUG' : False,
    },
    'staticfiles-filesystem' : {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': os.path.join(PROJECT_ROOT, 'static_cache'),
        'TIMEOUT': 100 * 365 * 24 * 60 * 60, # A hundred years!
        'OPTIONS': {
            'MAX_ENTRIES': 100 * 1000
        }
    },
    'staticfiles-mem' : {
        'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
        'LOCATION': 'staticfiles-mem'
    }
}



You can also get the code in this gist

A few notes:
  • I am using a named lock factory, which is also useful for other stuff. you can check it out in the gist.
    Django is not strict about being thread safe in the cache backend so you can remove the lock altogether but I prefer it this way
  • calling "get" might cause a side effect of setting the item on the cache backends that missed. This might cause the item timeout to be larger than originally requested, but no larger than the sum of the default timeouts of the cache backends in the chain

Problem solved - let's go eat!

Sunday, September 9, 2012

Staticfiles on heroku with django pipeline and S3

Static files are always a nasty bit, even more so when you serve them from a completely different web server.
I was recently required to do so for a Django project that is hosted on Heroku. It is strongly discouraged to serve your static from the web dyno, so I went with S3.

requirements

  1. Static files are served from S3
  2. Compile, minify, and combine the JS/CSS
  3. When working locally - serve the files from the Django without changing them.
  4. Don't rely on browser cache expiration - manage the versions of the static files

Since the process in 2 might take sometime, I didn't want it to block the loading of the dynos, so I didn't want to call collectstatic on the dyno.

Moreover, I wanted the version of the files to be perfectly synced with the server code. i.e. every version that is uploaded should have a corresponding static file package that is somehow linked to it and is immutable in the same since that a git commit is immutable.
This is not a common requirement but it makes a lot of sense, since a great majority of the static files ARE code, and a mismatch between versions could cause unpredictable behaviors.

solution


overview

  1. Use  django-pipeline to define packages (while still getting the original files on local env) 
  2. When deploying a new version, Collect the files using Django's "collectstatic".
  3. Use Django's CachedFilesMixin for static files version management
  4. Upload the files to S3 with s3cmd
  5. Commit the hash names of the static files to the code - this synchronizes the file version with the code version

Defining packages

Using django-pipeline, you can define the different packages, and also include files that require compilation (like less or coffeescript).  This done on the settings file, like so:

# CSS files that I want to package
PIPELINE_CSS = {
    'css_package': {
        'source_filenames': (
            r'file1.css',
            r'file2.less',
            ),
        'output_filename': 'package.css', # Must be in the root folder or we will have relative links problems
    },
}
PIPELINE_JS = {
    'js_package' : {
        'source_filenames': (
            r'file1.js',
            r'file2.coffee',
            ),
        'output_filename': 'package.js', # Must be in the root folder or we will have relative links problems
    }
}

PIPELINE_YUI_BINARY = ...
PIPELINE_COMPILERS = (
    'pipeline.compilers.coffee.CoffeeScriptCompiler',
    'pipeline.compilers.less.LessCompiler',
)

PIPELINE_COFFEE_SCRIPT_BINARY = 'coffee'
PIPELINE_LESS_BINARY = ...
# Storage for finding and compiling in local environment
PIPELINE_STORAGE = 'pipeline.storage.PipelineFinderStorage'

collecting files and adding version management

Collection is composed of a few steps:
  1. Find all the static files in all the apps this project is using (via INSTALLED_APPS)
  2. Copy all the files to the same root folder on the local env
  3. Create packages according to the pipeline settings.
  4. Append the md5 hash of each file to its name (so file.js is renamed to file.****.js)
  5. Go over CSS files that have imports and image referencing (like url()), and change the path to the new file name of that resource
This can be done by using a custom storage for the staticfiles app.

# Local location to keep static files before uploading them to S3
# This should be some temporary location and NOT committed to source control
STATIC_ROOT = ...
# Storage for collection, processing and serving in production
STATICFILES_STORAGE = 'myapp.storage.PipelineCachedStorage'

And the storage is simply:


class PipelineCachedStorage(PipelineMixin, CachedFilesMixin, StaticFilesStorage):
    pass

So whenever we execute the collectstatic management command, we get all the steps that are described above.
One caveat that you might encounter is that during step 5, if a resource is not found, it will raise an exception and won't continue. for example, if one of the css files in one of the apps you are using (might be 3rd party) is referencing a background image that does not exist, the collection process will fail when it reaches that file.
This is a bit too strict in my opinion so I used a derived version of the CachedFilesMixin that is less strict:

class MyCachedFilesMixin(CachedFilesMixin):
    def hashed_name(self, name, *a, **kw):
        try:
            return super(MyCachedFilesMixin, self).hashed_name(name, *a, **kw)
        except ValueError:
            print 'WARNING: Failed to find file %s. Cannot generate hashed name' % (name,)
            return name

Upload the files to S3

To upload the files, I use s3cmd which faster than anything else I have tried. You can actually set Django to upload the files directly to S3 when collection, but it will be much slower and will result in more S3 activity then doing it this way.

you can sync the local folder with the S3 bucket this way:

s3cmd sync collected/ s3://mybucket -v -P

Notice you can do this without harming the current version in production since static files that have changed will have a different file name, since we added the MD5 hash to their name.

To make Django create links to the files on S3 we use django-storages.  we update the production version with the AWS settings and use an S3BotoStorage with a corresponding STATIC_URL:


AWS_STORAGE_BUCKET_NAME = os.environ.get('AWS_STORAGE_BUCKET_NAME')
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_ENABLED = os.environ.get('AWS_ENABLED', True) # Should only be True in production
AWS_S3_CALLING_FORMAT = ProtocolIndependentOrdinaryCallingFormat()
AWS_QUERYSTRING_AUTH = False

STATIC_URL = '//s3.amazonaws.com/%s/' % AWS_STORAGE_BUCKET_NAME  if AWS_ENABLED else '/static/'
STATICFILES_STORAGE = 'myapp.storage.S3PipelineStorage' if AWS_ENABLED else 'myapp.storage.PipelineCachedStorage'

A few notes about these settings:
  • AWS_ENABLED should only be true in production so are not using S3 when working locally
  • AWS_S3_CALLING_FORMAT is now default to S3 subdomain bucket url which is great for CNAME but chrome does not like when you directly download assets from *.s3.amazon.com and raises sporadic security errors, so I prefer to keep using the original url scheme
  • AWS_QUERYSTRING_AUTH is disabled because there are currently too many bugs that make the signature wrong when you use S3BotoStorage and CachedFilesMixin together. hopefully, that will change soon
Also notice that I changed the STATICFILES_STORAGE to be  'myapp.storage.S3PipelineStorage' on production. This is the S3 equivalent of what we have on local env:

class S3PipelineStorage(PipelineMixin, CachedFilesMixin, S3BotoStorage):
    pass

Linking the static files version to the code version

So now we have different versions of the static files reside side by side on S3 without interfering. The last issue is to make sure each code version is linked to the correct static files version. Since we don't want the resources themselves to be available on the web dyno, we need to keep a separate mapping between file name and the versioned file name (with the hash).
One way to do so is by using a filesystem based cache. When files are collected, the CachedFilesMixin uses a Django Cache backend called 'staticfiles' (or the default if that is not defined) to keep the file names mapping. Using a filesystem based cache we can keep this mapping after the collection and then commit it to the code so it will be available to the web dyno when we push.
To add the filesystem based cache:
 
CACHES = {
    ...,
    'staticfiles' : {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': os.path.join(PROJECT_ROOT, 'static_cache'),
        'TIMEOUT': 100 * 365 * 24 * 60 * 60, # A hundred years!
        'OPTIONS': {
            'MAX_ENTRIES': 100 * 1000
        }
    },
}

Notice the cache is kept inside the project directory so it will be picked up by git.
The deployment script now contains:

rm -rf static_cache
manage.py collectstatic --noinput
s3cmd sync collected/ s3://bucket -v -P
git add static_cache
git commit static_cache -m "updated static files cache directory"

The cache is deleted in the beginning of the process and afterwards committed to the git repository (we commit just the folder that contains the cache, regardless of the status of the repository).
Again, this does not change anything on production. To do the actual deployment we just push to heroku and we immediately get all the code changes with the staticfiles changes.

Problem solved - let's go eat!