Peter Cooper : UK Web 2.0 and Ruby on Rails consultant
Recent Posts
»Jay-Z: From Brooklyn to the Boardroom
»Prank Caller Submits Girl To Sexual Torture By Proxy
>Full archive
Other Posts
« Who's in your league?Snippets code repository launches »

Matching on data intersection across a many-to-many join with Ruby on Rails


Merging and sorting data across many-to-many joins in database-land has always been a bugbear of mine. SQL is not the ideal language for the job, although with a lot of coaxing you can get it to deliver the goods.

Yesterday I started work on a "code snippets" site. I have a lot of code snippets from my various development work on Perl, Ruby, and Rails, and want a central repository for them. It'd also be nice if people could add and manage their own too. They'd be totally taggable too, so if I wanted to see all 'perl' code, I can do that. If I want to see all 'ruby' code that's an 'algorithm' for a 'sort', then bang, I have them there. Even better, I can look at just MY 'ruby' snippets, and so on.

Almost four hours ago now, I had all this working. The problem, however, was that I was using pure Ruby code to find all posts which had certain tags associated with them (here, 'tags' is an array of tag names):


eval "@posts = @tag.posts.select{ |p| #{tags.collect{ |tagname| " p.tags.detect { |t| t.name == '#{tagname}' } " }.join(" && ")} }"

I HAD to find an SQL way of doing it, otherwise when the number of snippets reached into the thousands, I'd have Ruby sitting there crunching a ton of data! That's a database's job, so I hunted for SQL solutions. It seemed using a join per tag was the only way. Someone (nick removed at their request) from #rubyonrails came up with this method (solely for testing purposes, this was clearly a first draft!):

SELECT posts.post_title, tags.tag, tags_1.tag, tags_2.tag, tags_3.tag, tags_4.tag, tags_5.tag, tags_6.tag, tags_7.tag, tags_8.tag, tags_9.tag FROM ((((((((((((((((((tags INNER JOIN (posts INNER JOIN posts_tags ON posts.id = posts_tags.post_id) ON tags.id = posts_tags.tag_id) INNER JOIN posts_tags AS posts_tags_1 ON posts_tags.post_id = posts_tags_1.post_id) INNER JOIN tags AS tags_1 ON posts_tags_1.tag_id = tags_1.id) INNER JOIN posts_tags AS posts_tags_2 ON posts_tags_1.post_id = posts_tags_2.post_id) INNER JOIN posts_tags AS posts_tags_3 ON posts_tags_2.post_id = posts_tags_3.post_id) INNER JOIN tags AS tags_2 ON posts_tags_2.tag_id = tags_2.id) INNER JOIN tags AS tags_3 ON posts_tags_3.tag_id = tags_3.id) INNER JOIN posts_tags AS posts_tags_4 ON posts_tags_3.post_id = posts_tags_4.post_id) INNER JOIN posts_tags AS posts_tags_5 ON posts_tags_4.post_id = posts_tags_5.post_id) INNER JOIN tags AS tags_4 ON posts_tags_4.tag_id = tags_4.id) INNER JOIN tags AS tags_5 ON posts_tags_5.tag_id = tags_5.id) INNER JOIN posts_tags AS posts_tags_6 ON posts_tags_5.post_id = posts_tags_6.post_id) INNER JOIN posts_tags AS posts_tags_7 ON posts_tags_6.post_id = posts_tags_7.post_id) INNER JOIN posts_tags AS posts_tags_8 ON posts_tags_7.post_id = posts_tags_8.post_id) INNER JOIN tags AS tags_6 ON posts_tags_6.tag_id = tags_6.id) INNER JOIN tags AS tags_7 ON posts_tags_7.tag_id = tags_7.id) INNER JOIN tags AS tags_8 ON posts_tags_8.tag_id = tags_8.id) INNER JOIN posts_tags AS posts_tags_9 ON posts_tags_8.post_id = posts_tags_9.post_id) INNER JOIN tags AS tags_9 ON posts_tags_9.tag_id = tags_9.id WHERE (((tags.tag)="tag1") AND ((tags_1.tag)="tag2") AND ((tags_2.tag)="tag3") AND ((tags_3.tag)="tag4") AND ((tags_4.tag)="tag5") AND ((tags_5.tag)="tag6") AND ((tags_6.tag)="tag7") AND ((tags_7.tag)="tag8") AND ((tags_8.tag)="tag9") AND ((tags_9.tag)="tag10"));

Despite censored's remarkable SQL, we both agreed this method had obvious disadvantages(!) and wanted to think up something better. We agreed that the major disadvantage would be that if there were a large number of tags (or posts, for that matter), the growing number of joins would bring down the house. The database server would, effectively, meltdown. So, the challenge was opened up to the #rubyonrails floor.

Denormalize! Denormalize! Denormalize!

rabble_, ODEO's chief coder, said that they'd had some big problems with doing tags and using SQL on their podcasting system. He said they'd had to denormalize a bit. So the topic of denormalization was discussed, and one proposal was that I store tags in a text field, apply a FULLTEXT index, and then use a regular boolean search over them. This would work perfectly and be seamless, but it seemed like a workaround something that SQL should be able to achieve. It lacked elegance, and, above all, proper normalization. (Yes, I'm a stickler when it comes to 3NF.)

No, Normalize!

The always optimistic censored and I discussed alternate solutions. His favorite idea was to do the following: 1) Select all tags we want to view, and the post IDs for each post associated with that tag. 2) Find all post IDs which are associated with EVERY tag queried. 3) Use a SELECT * FROM posts WHERE id IN (1,2,3,4,..,..,..) style query to get the results. This seemed like a viable method, so I set off on coding it.

After fifteen minutes I started to play with my SQL again. I realized that I could write a query which did an INNER JOIN with the tags table against the post_tags join table (and then the posts table itself). This meant I could get a list of every post which contained ANY of the required tags. That is, a union of results. But how do you get a union (OR) to an intersection (AND)?

How about a simple join and grouping?

After staring at rows of data, wanting to work out which ones were duplicated for each tag, I had a brainwave. I recalled that GROUP only takes place AFTER other joins, AND that GROUP has a HAVING operator which takes place AFTER the GROUP! So I wanted to GROUP my results by unique post ID and THEN only show those with a COUNT matching the number of tags I was searching for! It seemed like a far out method, but I tapped in the SQL as best I could, and holy moly.. it worked! censored was as surprised as I was, but very happy it worked out.

We ended up with this:

@posts = Post.find_by_sql ("SELECT p.* FROM posts_tags pt, posts p, tags t WHERE pt.tag_id = t.id AND (t.name = '" + tags.uniq.join ('\' OR t.name=\'') + "') AND p.id=pt.post_id GROUP BY p.id HAVING COUNT(p.id) = " + tags.uniq.length.to_s)

What this does is do a regular join between the posts, post_tags, and tags table, in order to get every combination of posts and their tags, using a WHERE clause to only choose posts with certain tags. Then we group by the post's ID to get individual posts, BUT then only show those which have the correct amount of tags. Therefore, all the results MUST be common to every tag specified in 'tag'. Eureka! We now have a proper tagging system working, with infinite depth search..




April 02, 2005 | Posted by peter | Comments (7)
Comments

Nice. Its been on my todo list to write up a tagging + sql how to. But I think this about covers it.

The one other query I've been playing with in the context of my tagging app is "related tags", tags which appear N or more times with the tag(s) I'm currently viewing.

Posted by: kellan at April 5, 2005 06:10 PM

Should have looked at snippets before posting. You came up w/ the same related to query I did. Was hoping there was a way to do it without the sub-query, as I'm seeing some performance problems w/ MySQL 4.1.x's subqueries.

Posted by: kellan at April 5, 2005 06:13 PM

Great tip! Here's what I did:

a_item_tag structure: itemID, vcTag

select * from item
where postItemID in
(select itemID from a_item_tag
where itemID in (select itemID from a_item_tag
where vcTag in ('tag1','tag2','tag3') )
group by itemID
having count(itemID)>2
)

Posted by: andreinla at April 8, 2005 11:08 PM

Thanks for the tip! here's how I did it.

a_item_tag: itemID, vcTag


select * from item
where postItemID in
(select itemID from a_item_tag
where itemID in (select itemID from a_item_tag
where vcTag in ('tag1','tag2','tag3') )
group by itemID
having count(itemID)>2
)

Posted by: andreinla at April 8, 2005 11:09 PM

a_item_tag: itemID, vcTag


select * from item
where postItemID in
(select itemID from a_item_tag
where itemID in (select itemID from a_item_tag
where vcTag in ('tag1','tag2','tag3') )
group by itemID
having count(itemID)>2
)

Posted by: andreinla at April 8, 2005 11:10 PM

This seems overly complicated. Here's two sites I created that use tags for posts, it seems to work pretty well: http://brotherjoe.com and http://hellojoseph.com ... It looks like you have a seperate tags table that you link posts to with a linking table I'm assuming. I just store them as a varchar(255) and use regex to find whatever combo of tags is requested. Try clicking on tags on those sites and you'll see related tags, which you can click on to view just that related tag, or click the + sign next to it to "add" it to the current tag filter. And the code is dead simple:

if( !empty( $array['tags'] )) {
$tags = explode( " ", $array['tags'] );
foreach( $tags as $tag ) $extra_tag .= "and e.tags REGEXP '[[::]]' ";
}

and then $extra_tag is sitting there after my where clause in my query. Well, just another take on it.

Posted by: sean at April 10, 2005 03:30 AM

Forgot to say, of course, that the code quoted is for PHP, not ruby, but I'm sure you would have no problem coming up with an equivalent. :) My main point was how I'm storing it in the database and how I'm searching for tag matches.

Posted by: sean at April 10, 2005 03:32 AM

Return to the homepage.
Privacy Policy