Time for another research blog post! This is how I do open notebook science, right here. :)
(side note about Open Notebook Science: Okay, okay, this blog entry isn't the truest sense of Open Notebook Science, seeing as I'm not uploading my discarded datasets or source code. But, I don't think the value that anyone could get from my discarded stuff can be higher than what it would cost me to go to the trouble of cleaning it up and posting it. I think the most valuable thing I can share are the stories of my experiences and the take-away lessons learned, hence this entry. That said, I do occasionally share milestone contributions such as this java applet from last year.)
Anyone who has followed my blog over the years will know that my research has been a quest to create a "thing", or a "guide" that can use all available resources on the WWW to create learning experiences, like little adventures you can follow for the purposes of learning a lesson or gaining practice or experience in a given area. It would be like a self-guided study, only with a little program to prepare fun experiences for you and push you futher than you might have otherwise gone on your own. I'm not trying to create lessons or learning adventures - that's the work of an instructional designer - but I am trying to create the thing that creates the learning adventures. So even though I work in a different field, I always need to listen carefully to what educators and instructional designers say, otherwise my work would be terribly uniformed.
Anyway, my approach to this throughout my M.Sc. experience has been to use my supervisor's Ecological Approach. This means I'm treating the WWW as a bank of learning objects and I can assume that each learner has an agent and there are a whole bunch of metadata in some structure (like an API) available to me. This EA metadata is like molding clay, or building blocks: it's what gets inputted and outputted by my "things" or "guides" that I'm designing. I first wrote about this in 2006 - at the time I was thinking of using RDF. Since then, I have discovered simulation, so I can use any format I want and I don't have to worry about the actual implementation right now. (But I will have to, eventually!)
Last year, I developed an approach that leveraged Apache Mahout's recommender libraries. Most recommender sytems are used to recommend things like books or movies or some kind of product. I had twisted the system around so that the item to be recommended wasn't a book or movie, but rather a sequence of learning objects. That way, I could use already existing algorithms for a new purpose, that is, use collaborative filtering on sequences of things.
The biggest problem, predictably, is that when you create sequences of things, suddenly the number of "items" explodes. If you have just 40 learning objects, and even if you don't care about the ordering of the items in the sequence, that's still over 90 000 "items" if you take 4 at a time. (nCr, n=40, r=4. It's even bigger if you are true to the definition of "sequence" and you use nPr). Having a large number of items can be problematic because it becomes time consuming for the recommender algorithm to churn through all the calculations.
So that is why I am writing this blog entry -- I'm in the middle of trying numerous approaches to address this. I have so many approaches on the go that I've decided to write them down.
For instance, I am experimenting with ways to fine-tune the Mahout settings for optimal performance. I have to make sure I'm using Mahout to the best of its ability so I can get a taste of what the current limits are for non-sequential item recommendations. (taste, hehehe. little private joke.)
On top of this, I am also experimenting with various approaches to collapse the 90 000 "items" (or astronomically higher numbers than 90 000!). For example, I can chop up the plane (discretize the dimension? what language to people use here? ) by varying the definition of "item"; basically I create clusters. For example I might allow two sequences of length 4 to be considered the SAME item if they have enough learning objects in common. I did this by creating a new "threshold" parameter in my simulation, where the threshold must be <= k.
But creating equivalence classes like this gets tricky because a given sequence might have overlapping possible clusters. There are a lot of delightful machine learning / clustering approaches I could/should try here, if only I had some spare time!
So far, my strategies to tackle the exploding items problem have been:
1) tuning the engine for performance (ex. switching to hard files instead of database)
2) changing the definition of "item" by creating the threshold parameter.
Yet a third approach I have tried is by 3) pre-loading the simulation with a strategically generated synthetic dataset of starter ratings. Theoretically, a recommender algorithm should be able to run on a huge dataset if it has enough starter data to create appropriate comparisons across users or items. This has been fun and is helping my understanding of the inner workings of the algorithms, because I need to understand the algorithm in order to create a helpful dataset to "kick start" the engine.
Recently, my advisor suggested a fourth approach, which involves 4) cutting Mahout out of the picture and inserting our own algorithm. I know this will have to be done eventually, but sometimes it seems too early to do it. I want to finish exploring 1), 2) and 3) before starting 4)! But the approach my advisor suggested is quite brilliant actually and could be far more efficient than Mahout's recommender libraries, because they weren't actually designed to do what I am trying to do.
So that is why I wrote this blog entry. I wanted to get all of this out of my head and commit it to "paper" so I can keep coming back to this as I push forward on 4) but will inevitably sidetrack on various pieces connected to 1), 2) and 3).
Wee!
Steph
No comments:
Post a Comment