Fun with Redis and Translation Tables

One of the first projects I took on at POPSUGAR was adding Internationalization and Localization capabilities to the site. While the general approach and APIs for I18N/L10N are well known and widely available, there were still a couple of interesting challenges.

Since we forked the POPSUGAR CMS from an early version of the Drupal, it was fairly trivial to grab the locale module from the latest version, and backport it to the existing system. This added support for the GNU gettext import/export format along simple translation calls for text, plurals, counts, etc. The Drupal locale “source” schema used a simple storage module where the unique key was the original English text combined with an optional context string. The simplicity of that model means developers do not have to create or track ids for every unique text string in the code. The downside is that rows in the locale source table are easily orphaned whenever a text string is changed, with no automated mechanism to later discover them. While the wasted data storage and table growth rate are miniscule when compared to the rest of the system, there is still a need to export the complete source table for translation to a new language. Since human translators are used, it would preferable to not have them waste time on unused strings.

The first obvious solution is to include a “last accessed” column in the locale source table and update this whenever a text string is used. Unfortunately, that scales badly when a web server is generating millions of pages per day/hour/minute, all using the same strings. Avoiding SQL database access (not to mention writes vs. reads) is paramount to page generation performance, and the entire translation table is kept in memcache to avoid having to do any queries. Furthermore, some strings like “Next Slide” are used far more frequently than others, potentially requiring 100’s of updates to the same row when servicing a single web request.

My next thought was to store the last access time in a noSQL database and process the results in batch on a regular basis. We already had a generic Redis instance available for miscellaneous tasks and this turned out to be quite easy to implement. But once again, even sending messages to a Redis server takes time and I did not want to take that hit during page generation, nor send multiple messages for the same string from a single page. It occurred to me a simple solution this second problem was PHP’s register_shutdown_function() which allows you to specify a function to be called once the current PHP request has completed, but before the process “exits”, similar to a C++ object destructor. During the page generation, a simple key/value array is populated with the source table index pointing to a calling function with line number. This naturally becomes a unique list of table row pointers, which is then sent to the Redis server via Predis::hMSet() method, which updates or inserts each row in the array as a Redis key/value, thus de-deduping values from other page requests. This also allowed us to leverage Redis key expiration, such that deprecated source strings simply fall out of the dataset.

The next step was to build a batch script that collected all the unique keys from Redis every 2 weeks and then filtered the SQL table by items which weren’t found in Redis. These strings can be safely removed from source locale table, since they aren’t actually being used by the web server. With roughly 60MM unique visitors per month, we were fairly confident all the locale sources were being used, and the cost of a false negative is extremely small since the Drupal module will just recreate that row in the table. Furthermore, by including the calling function and line number of each translation, we are able to verify the string has changed as well as find simple programming errors where someone attempted to translate editorial content rather than user interface.

There are certainly more elegant solutions that could have been developed, perhaps moving the entire localization table into a noSQL database and leveraging key expiration. That of course would require re-writing all the existing Drupal admin and API interfaces to work with that data storage mechanism. Like many other media companies, we have limited resources and when applying them to a problem, we need to insure the ROI is greater than the effort applied. A complete rewrite of the Drupal localization module wouldn’t be justified for our use case.

This post originally appeared at https://engineering.popsugar.com/Using-Redis-MYSQL-Orphan-Management-44960001