Tests I don't like: Validations & Associations

The more time I spend writing tests, the more I learn and start to develop opinions about how to test certain things. This post is the first in what may be several attempts of mine to look at how to improve upon what may be a common way people try to test certain scenarios. These aren't meant to be revelations in design or anything, more so just a reminder to avoid some clunky tests here and there. So to start:

Validations & Associations

This is the type of test I really don't like:

it "should require username" do
  user = User.new(all_attrs_but_username)
  assert user.invalid?
  assert user.errors[:username].any?
  user.username = "foo"
  assert user.valid?
end

This can vary a bit, but the gist is there. The "check without the attr, set the attr, then check again" approach feels extremely clunky to me, and requires so much work for what is so little work on the other side in the production code. Would you want one of these tests for every attribute on every model? I don't. As an alternative, just use matchers:

it { should validate_presence_of(:username) }

The one thing this obviously implies is that you should be writing your tests in a framework that supports this, or you should include a gem to help you. The example above uses shoulda:

https://github.com/thoughtbot/shoulda

Again, this feels obvious, but it seems like people still default to just using plain old straight up Test::Unit type tests as if there is something stopping them from making their tests cleaner. If you aren't using a syntax like Rspec or Shoulda, I think you're making your life worse than it needs to be.

Once you start using this, you can use the same simple syntax to check slightly more complex validation behavior, and also associations:

it { should_not allow_value("a"*3).for(:title) }  
it { should allow_value("a"*4).for(:title)  } #length tests
it { should have_many(:comments).dependent(:destroy) }   

In my opinion, if you're putting too much work into these types of tests (like in the first example), you're starting to test Rails. If my model has the right validation or association parameters supplied, I trust Rails to handle the rest. And if you don't trust Rails, you most likely have some tests at a higher level that actually make use of these objects which should provide additional validation that things are setup at the lowest level correctly.

A New Language And A New Domain

I've decided to start learning Objective-C. While that seems simple enough, I wanted to share a bit of the reasoning....

After working with Ruby/Rails for a while, I have been finding myself with the desire to start learning something totally new. For some time though, I really haven't been sure what that new thing should be. All I know is I want it to be a new programming language. Furthermore, I don't just want it to be a weekend interest. I want to start a new avenue of my experiences of building software. Don't get me wrong, I still want to be writing Ruby code every day for now, but I also want to gain some legitimate experience with something new.

There's no shortage of cool languages out there that I'm curious about. Hell, I could probably find a reason to be curious about any language. For me though, that ends up feeling like part of the struggle and goes against my goal to maintain interest in my selection. Without a good reason to pick anyone of them, I'm not likely to stick with my choice for any extended period of time while I keep Ruby as my primary language. But I think I started to realize why I never seem to want to let myself get too far from Ruby. With that environment, I'm capable of building the only type of software I have extensive professional experience building: web applications. It's what I know and it's what I'm constantly thinking about.

Having become proficient with Ruby (and the same can be said for any language/tool), I can now more easily explore the finer details of building software with it, and I can start to develop opinions on these matters. This isn't meant to sound like a revelation, it's pretty obvious. It is, however, important as a point of justification and clarity for what I noticed in my underlying behavior as I began forming such opinions on my own:

I want to continue building web applications with Ruby in the immediate future because I value the opinions and experience that comes from having extensive knowledge of one language and it's toolset.

Given this, it's no wonder I'm not spending my weekends learning Node.js, Python, Clojure, Scala, etc. When I think of these languages and the type of software I would build with them, I think of web applications. But I already stated that I want to continue building web applications with Ruby. Now, that's not to say that you can't write code with these languages that runs outside the web. But being honest about what I would try to build with them leads me back to the web. Hence, it feels like the key to consistent motivation in learning a new language relies on using that language to work into a new domain of software. In this case, that domain is mobile development.

With mobile development, I can start learning about building a new type of software and that software will be things that, at least in the programmer's pipe-dream sense, people can really use! I won't be hacking on little scripts, I won't be re-writing the same exercises I did with Rails 3 years ago, and I can learn a language that is totally new to me with a bit of a different flavor from Ruby (mmmm static-ish typing) all while building a totally new type of valuable software. Given that Java does not fit the "totally new to me" bill, Objective-C feels like an obvious choice.

Well, I'm glad I got that sorted out....

Disclaimer: I do not want this to sound like I do not value the opinions that come from a more breadth-first approach to learning languages. Rather, for me in the immediate future, I find more value in obtaining my goals as a developer in a more depth-first oriented approach. More than anything it seems like it could simply be chalked up to a matter of opinion, but I felt it was worth mentioning....

O(1) -> O(n) = You're Screwed

As a software developer, you never want to create bugs in your software. That said, when bugs creep up, they can be a lot of fun to figure out. There has always been a part of me that enjoys figuring out tricky bugs and recently I came across one of the strangest I’ve ever had to deal with.

It all started when one of the applications as part of a larger system started experiencing serious performance problems. I’m talking application coming to a complete halt kind of performance issue. Requests were hung and users were getting errors as every call to a key back-end service was timing out. Looking into the logs showed that one user seemed to be the root cause of the issue, but we couldn’t fathom why their data set would cause this kind of an issue. They had good amount of data, but nothing outrageous and we had done tons of testing with much larger sets of data without any sort of performance problem. We did not have direct access to the production data so at first, we were stumped. Was there some issue with the structure of her data set? Was the data corrupt in some odd way? Was there an obscure bug in the back-end service somewhere?

We knew the general area where the time was being spent. It was when the service was putting all of the users data into a hash table it maintains in memory. We looked at the code, and it seemed pretty straight forward. There was really only one spot that had any chance of slowing down: the collision resolution in the table when more than one item hashes to the same location. Sure, collisions can happen, but not to the degree we are talking about here. For this sort of problem we would need to have almost everything in the users data set hashing to the same location. Crazy, right? I mean, that should never happen!

Well, it happened…..

We were finally able to recreate the users data set locally (size wise, not data wise) and saw what we couldn’t believe: Every single entry was hashing to the same spot in the table. All of them. Remember, this is a hash table. This sort of problem means we might as well be using a linked list to store the data since collisions were handled by simply maintaining a list of all entries at a particular location. Our insertion/retrieval times were jumping from being O(1) to O(n) and that order of magnitude increase for this user was killing the application. It’s the picture perfect worst case scenario for the data structure. Obviously, this is the fault of the hashing algorithm being used. However, the weird thing was that the algorithm used was based off what seemed to be a fairly well known algorithm: Dan Bernstein’s hashing algorithm. There’s a few variations of the algorithm, and from what bit of searching around I’ve done, it seems like it is known that this algorithm can possibly slip into a “degenerate” situation. So, this probably wasn’t the algorithm to use, but what the hell happened?

Well, remember when I said we had done tests on data sets bigger than this users set? Why didn’t that show the problem? It’s because we needed a data set of exactly the size this user happened to create. Allow me to give the nitty gritty details:

-The user had a data set which had 123,577 entries of a particular type.
-The service in question allocated a hash table of size 131,769 (users data size plus a bit extra)
-It then began hashing all the users data using the database id of the fields as the key to the hash table
-given back a value from the hash, it then took that value modulo 131,769
-The resulting value, which is supposed to be that entries location in the hash table was 0. For every entry.

To put it simply: Every hash value coming out of the algorithm had 131,769 as a factor. So every value hashed to the location ‘0′. I don’t want to make bold statements about the root cause or nature of this behavior yet (another post for that), but our initial experiments showed that this was always the case for any integer value going into the algorithm which fit within 32-bits but was allocated in a 64-bit data type. We figure it has to do with the bit shifting used in the algorithm and the fact that it never shifted certain bits off the end (the extra 32 bits gave enough room). But again, more on that to come I hope!

As an aside, since it exhibited this behavior for 131,769, it’s worth noting that the same behavior holds true for any factor of 131,769 as well (not that this should be surprising). This was just a good thing to remember as it highlighted other values which could cause performance issues which were much higher than expected. I hope to get into more details of this situation and algorithm in the near future, but for now I just wanted to tell the story and point out the potential for chaos when using Bernsteins 33 times with addition hash!

A New Blog

So, I decided I wanted to have more control over my blog. Having a place where I could try out new things, and could more easily update my blog with new features, styles, etc all seemed better than having some wordpress theme I didn't design hosted in an environment I can't easily update and don't really have any good control over.

Obviously this is nothing fancy, but it should get the job done for now.

All Posts
About
atom feed