09.11.11

RubyInline Tricks and Tips

Posted in Linux, Ruby at 3:31 pm by mike

Permutative Schedule Calculation

One of the most compute intensive tasks in the RSportz system is the scheduling code. It turns out that scheduling pool and cross pool games between a number of teams is actually a rather challenging computer science problem. Add in seeding, home/away venue requirements, and calendar limitations and you have yourself a pretty hairy set of variables to deal with. You end up creating a fairly classic permutation problem. So much so that the algorithm RSportz uses was derived from this paper: U. Schoning. A probabilistic algorithm for k -SAT and constraint satisfaction problems. Proc. 40th FOCS, 1999 (Still looking for a copy of this BTW)

The code was originally written in Java and then directly translated to Ruby by someone who didn’t really know either language that well. The first red flag I found were the hand coded bubble sorts, which were copied directly from the Java version. 🙁

But the core algorithm was pretty straight forward. Create an NxN matrix where N represents the number of teams. Rows are (arbitrarily) home games and columns are away games. Walk through the matrix and block out all the games that can’t be played due to seeding or other restrictions, then start filling the matrix with games, insuring that the number of home and away games for each team are balanced. Once you have a complete matrix for the number of games each team should play, score that matrix according to the mix of cross-group play to insure a much randomization as possible, while still avoid top seed match ups when possible.

The real world comparison here is the NFL, where each team must play every team in their division twice, and then some mix of cross-division and cross-conference play. While I suspect the NFL primarily uses TV market size data to drive their out of division play, our algorithm wants to make sure every team plays as many different divisions as possible in it’s cross group play.

So we score each complete matrix, and then move on to another combination, until we either reach the maximum possible score, or exhaust all combinations of the matrix or hit a time limit.   Hopefully we have at least one matrix before hitting the time limit. Now those of you familiar with combinatorial math know this numbers get very large very quickly. For instance, for 20 teams playing 5 games each, the number of combinations is “400 choose 50” or 1.70359003 × 1064 (a really big number). Now we do eliminate some games since we know each team can’t play itself, and there may be seed, home and away game restrictions, but as you can see, it can be a time consuming problem to solve.

So when I first started playing with the system, I was a little shocked to find out that to schedule 5 cross-group games between 4 groups of 10 teams took 3 minutes (basically, it was hitting the built in timeout).     I first walked through the code replacing all the bubble sorts with native ruby array sorting calls, and attempting to reduce unnecessary object copying where possible.     Finally, I realized that the system was finding the best possible score for this scenario in 30 seconds, but was continuing to process until the time limit was exhausted.     So the first thing I did was calculate the highest possible score for the matrix, and then made the algorithm stop when it was hit.  (We still need to go back and consider the speed a perfect score vs. 60-80%.)   That got me down to 30 seconds for the 40 team scenario, but it still seemed way too slow for me, at which point I started investigating how to call out to C code in Ruby.

As an extra motivator, when this problem was described to a friend of mine, he felt compelled to tweet “Re-implementing your O(n!) algorithm in C (from Ruby) probably isn’t going to help”.   I have to thank him for the inspiration.

RubyInline

I cannot say enough good things about RubyInline.  It makes dropping into C code simply trivial since you just place your C code in the middle of your Ruby code.   No Makefiles or compiler options to worry about.    You simply include the gem

require "inline"

And then add your code as a lamda section within a inline do block:

  inline do |builder|
    builder.c '
    static VALUE assign_c_wrapper(VALUE total_teams, VALUE matrix, VALUE team_games, ....) {
       ...
       return ret ? Qtrue : Qfalse;
    }'
  end

The argument to builder.c needs to be a string, so you do need to be careful with your single and double quotes and make sure they are escaped properly if needed in the code. Obviously, anything that returned a string would work here, so the C code could be in it’s own file at this point and returned by an IO method, but that undermines the whole point. A search for RubyInline will show you all sorts of examples.

Although RubyInline will do automatic argument conversion for you with simple data-types, I was dealing with 2D arrays so I chose not to take that route and converted them myself, hence the reason all my arguments are VALUEs, which are essentially just pointers to Ruby objects.

Now the folks at ZenSpider have taken this one step further and have another gem called RubyToC, which simply translates your Ruby code directly to C and then compiles it via RubyInline. Quite clever, but once again, only the simplest data types are supported.

A couple bookmarks that are handy to have while your writing the code: One is the Extending Ruby chapter from the Programming Ruby guide. This has a nice table which lays out the type conversions. The other one is the Yukihiro’s original README.ext which is in the Ruby source distribution, but is also kept online and nicely formatted in HTML by eqqon.com

So the code to convert a 2 dimensional array from Ruby to C ends up looking like this:

      // matrix is the Ruby NxN integer array.  Created with the code
      //  matrix=Array.new(total_teams) { Array.new(total_teams, 0) }

       char c_matrix[c_total_teams][c_total_teams];
       VALUE *matrix_row = RARRAY_PTR(matrix);
        for (i = 0; i < c_total_teams; i++) {
           VALUE *matrix_col = RARRAY_PTR(matrix_row[i]);
           for (j = 0; j < c_total_teams; j++) {
             c_matrix[i][j] = FIX2INT(matrix_col[j]);
           }
         }

The inline builder block also takes some useful arguments. Compiler flags when needed as well as a “prefix” section which basically get treated as a C include file for the block of code. Being an old C pre-processor hack, I found this very handy. More on that later.

inline(:C) do |builder|
       builder.add_compile_flags '-ggdb'
       builder.prefix '
// We need to play games in order to pass around 2D arrays of variable size
// Luckily, we know the size of the second dimension
#define MATRIX(_i, _j) matrix[((_i) * total_teams) + (_j)]

#define TPG_FACTOR 1.2
#define GPG_FACTOR 1.1
#define GPT_FACTOR 1
...

The one gotcha was the line numbering in gdb was off. RubyInline tries to add a #line directive to the C code it generates in your home directory: ~/.ruby_inline, but they are off. I tried to adjust them in the prefix section, but didn’t have much luck, but I was able to use gdb to step through my code after adjusting the line numbers manually.

16-156x Performance Improvement

So most users of Ruby Inline report a 10x performance improvements, but even given the factorial nature of my algorithm, I strongly suspected we could do even better. As I mentioned before, the code was originally written in Java, and I actually don’t expect you would see a significant performance increase between Java and C, but the algorithm was a worst case scenario for a duck typing language with no native datatypes like Ruby. For instance, take the following lines directly from the exiting ruby code:

    matrix=Array.new(total_teams) { Array.new(total_teams, 0) }

Seems simple enough, all I want is an NxN array of integers, which each value initialized to 0. But in Ruby, just because they are integers a this moment in time, doesn’t mean someone can’t insert an ActiveRecord object into the middle of this matrix sometime in the future. So in this case we end up asking the memory management system for N arrays and N*N integer objects and initialize each one to 0. Now for the same code in C:

    char matrix[total_teams][total_teams];
    bzero(matrix, sizeof(matrix));

Optimized by the compiler, literally three instructions to multiply the values, bump the stack pointer, and then zero the memory range. And my Intel ASM knowledge is 15yrs old. I wouldn’t be surprised to find there’s now one instruction to do all of the above.

Of course, we don’t allocate the matrix O(n!) times, but we do execute the following harmless looking operation:

matrix[row][col] += 1

Once again, quite a bit of work in a duck type language where the interpreter has no idea what type of object is at that position in the matrix, while in C

matrix[row][col]++

Is once again optimized down to 1 instruction.

By the Numbers

So my test case (spec of course) was pretty simple.    It schedules games for a 4 group round robin league with 10 teams in each group.     Each group played 6 games within it’s own group and 4 cross group games.   Teams were seeded such that the top 20 seeded teams cannot play each other until the playoff start.    The algorithm is recursive by nature, so I counted the number of recursions and the number of times we scored a valid matrix and sent that to the debug log each time a better scoring matrix was found. In this case, I let the code run until the timeout was hit, so we could better compare the speed of the code.

# Games #Teams Recursions Maxtrices Tested Time
Ruby 30 10 662,063 2,521 16s
C 30 10 662,934 3,392 1s
Ruby 80 40 5,313,768 22,495 180s
C 80 40 138,884,345 3,508,954 180s

Note that in the 80 game case, we still hit the time limit before we reached the best possible score for the matrix, while in the case of the smaller ones, we found the base matrix prior to hitting the time limit. We do have some knowledge of which teams are going to be the most difficult to schedule, hence we can order the matrix to try those teams first, so the longer the algorithm runs, the less likely it is we’ll find a better scoring matrix. Since I was only measuring to second resolution, the 30 game case clearly showed a 16x performance improvement. But for the 80 game case, the C code was able to test 156 times more matrices than the Ruby code. Have to say I’m pretty happy with the results, though I’ve since done some further optimizations on our scoring algorithm, and deciding when a game matrix is “good enough”.

I suspect my theory-favoring friend will argue that a sample set of 1 doesn’t disprove his assertion, but my recommendation for optimizing code has been repeated by many others: profile, profile, profile.

Misc Tricks

The one thing that irritated me while developing the C is I found myself wanting a DEBUG wrapper which I’ve written many time before but didn’t have on hand. Since I had to recreate the macro from scratch, I figured I’d include the version here for myself and anyone else who needs it. In this case, I wanted to call the debug method on the Ruby logger from within C code, and make it easy enough to insert debug messages in the C code. I’ve been accused of abusing the C pre-processor in the past, so your mileage may vary:

#define DEBUG_LOG(_format, ...) \
    do { \
      VALUE _logger = rb_iv_get(self, "@logger"); \
      ID _debug = rb_intern("debug"); \
      char _message[2048]; \
      snprintf(_message, sizeof(_message), _format, ##__VA_ARGS__); \
      _message[sizeof(_message) - 1] = 0; \
      rb_funcall(_logger, _debug, 1, rb_str_new2(_message));\
    } while(0)

The “do {} while(0)” is an old trick to avoid having to worry about nested blocks or semi-colons. The use of the code is as follows:

      DEBUG_LOG("SCHED: assign(%ld) evaluate score increased from %f to %f after %ld"
                " attempts! teams: %d, games: %d, groups %d, %ld seconds elapsed",
                GET_MEMBER_LONG("@assign_calls"), best_score,  score, GET_MEMBER_LONG("@evaluation_attempts"),
                total_teams, games_scheduled, ngroups, time(0) - start_time);

Finally, I also found the following macros handy (all this was included in my builder.prefix section):

#define GET_MEMBER_LONG(_name) FIX2LONG(rb_iv_get(self, _name))
#define GET_MEMBER_DBL(_name) NUM2DBL(rb_iv_get(self, _name))
#define INCR_MEMBER(_name)  rb_iv_set(self, _name, LONG2FIX(GET_MEMBER_LONG(_name) + 1))

Finally, if you ever forget the syntax for passing a multidimensional array in C, it looks something like this:

     static int verify(char team_games[][2], int row, int col, int ngames, int max_games) {

You need to declare N-1 of the dimensions, or pass them in as variables and then do the indexing yourself.