Extending Code for Other Analysis

The code base is designed to be quite extensible, allowing users to easily create new types of analysis of Wikipedia dump fils.

Some of these currently implemented include:

Here's how to create your own custom analysis tool:

Start with page_factory.ml. The wiki dumps are analyzed by a class of type page. Page is a virtual class; the various classes that implement it provide the functionality, for instance, to compute the statistics for author reputation, or to compute text trust. Objects of type page are produced by page_factory.

Each page, to do its work, uses objects of one of the sub-classes of revision in order to store revisions; there are many subclasses, as what we need to store for each revision is not always the same.

First, add your new analysis to the analysis_t type (For example, Linear_analysis) at the top of page_factory.ml.

Next, update print_mode, again just adding in your new analysis.

Now, add a new method, set_linear_analysis for example, which should look like this:

   1 method set_linear_analysis () = mode <- Linear_analysis

Moving down page_factory.ml, add in another flag to the get_arg_parser list, which will tell page_factory to do your new analysis when set.

For example, ("-linear_anaysis", Arg.Unit self#set_linear_analysis, "Does linear analysis.");

Lastly, create a handler which will instantiate your custom implementation of the Page class:

   1 Linear_analysis -> new Linear_analysis.page id title out_file
   2           n_text_judging n_edit_judging !equate_anons


That's it for the page_factory.ml. Now, lets look at what we need to do with our page implementation.

First, we need whatever inputs we gave the class in the page_factory method.

For example:

   1 class page
   2   (id: int)
   3   (title: string)
   4   (out_file: out_channel)
   5   (page_seq_n: int)
   6   =
   7   object (self)

Now, along with any custom methods you may need, you need to implement the following methods:

   1 method add_revision
   2  (id: int) (* revision id *)
   3       (page_id: int) (* page id *)
   4       (timestamp: string) (* timestamp string *)
   5       (time: float) (* time, as a floating point *)
   6       (contributor: string) (* name of the contributor *)
   7       (user_id: int) (* user id *)
   8       (ip_addr: string)
   9       (username: string) (* name of the user *)
  10       (is_minor: bool)
  11       (comment: string)
  12       (text_init: string Vec.t) (* Text of the revision, still to be split into words *)
  13       : unit =
  14 
  15 method eval -> unit

Lastly, you will need to figure out what type of revision to use. You can either create your own custom revision class, or else use one of the pre-defined ones.

Regardless, you will need to have a Vec mutable var like so in your page class:

   1 val mutable revs : Revision.write_only_revision Vec.t = Vec.empty

Also, in add_revision, you will need to create a new revision of your chosen type, like so:

   1 let r = new Revision.write_only_revision id page_id timestamp time contributor
   2     user_id ip_addr username is_minor comment text_init in
   3 revs <- Vec.append r revs;

Currently there are these revision types to choose from (all defined in the revision.ml file):

They all have different uses, so be sure to look closely to see which one meets your requirements. If none do, feel free to create a new one.

And that's it! The code will add revisions to your page class, and when done adding revisions, will call eval. Whatever else happens is up to you.

Call your new analysis like so:

./evalwiki -linear_anaysis

Good luck.

Extend the Code Base (last edited 2008-06-05 20:29:28 by IanPye)