DM Stat-1 Articles
Link to Home

Link to Articles

Link to Consulting

Link to Seminar

Link to Stat-Chat

Link to Software

Link to Clients

Explaining Collaborative Filtering:
An Openwork
Bruce Ratner, Ph.D.

Collaborative Filtering (CF) is a method for predicting an individual’s next item selection, or recommending a collection of items from which to choose, by matching that individual’s profile of past item selections with the profile of like-minded individuals’ past item selections. For example, when one is buying my book Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data on Amazon.com, she is offered the recommendation “Customers who bought this item (book) also bought …. < A list of five books is given >.” Amazon is among the better known users of CF; but, there are many users across industry sectors, such as music, movies at the theater, video/CD movies, shoes (Zappos.com) and even online dating (JDate.com)! There are numerous commercial and non-commercial CF systems. However, they are all “black boxes.” Some CFs are called “propriety,” in a disingenuously effort of trying to be forthcoming. Regardless, unless the innards of CF systems are in the open, users and potential users will have uneasiness about CF-acceptance. The purpose of this article is to show the openwork of CF using SAS© procedures; thereby, revealing the secrets of the mystery CF to insure ease of acceptance and eagerness in its use.

OUTLINE

  1. Collect a large database of individual’s item selections.
  2. Perform a cluster analysis using SAS procedure VARCLUS on the individuals – not the items. This requires a transposition of the dataset by using SAS procedure TRANSPOSE.
  3. For each cluster, measure its like-mindedness: Calculate the average correlation among the individuals in the cluster across their “relevant” item selections.
  4. For a new individual, for each cluster: Measure the change in the cluster’s average correlation when the new individual is included in the cluster.
  5. Identify the new individual’s cluster: The new individual's cluster is  the cluster that experiences the greatest percentage increase in the average correlation when the new individual is included in the cluster. 
  6. Profile the new individual‘s cluster: The array of means of the relevant item selections among the individuals in the cluster.
  7. Measure the "likeness" of the new individual with its cluster: Calculate the correlation coefficient between the new individual‘s profile of item selections and the profile of the new individual‘s cluster.
  8. Prediction or recommendation for the new individual: If the correlation coefficient (in step 7) is “large,” then an item-by item comparison between the two profiles will yield the next item or collection of items to recommend.
  9. If the CF fails: If the correlation coefficient is not large, then either no prediction or recommendation can be made. In such case, heuristically examine the relevancy of the items within the clusters using SAS procedure VARCLUS on the items; then, repeat step 2.
  10. Optional: Heuristically examine the clusters (in step 2), and then repeat step 2.



For more information about this article, call Bruce Ratner at 516.791.3544,
1 800 DM STAT-1, or e-mail at br@dmstat1.com.