This blog is going to be about Natural Language Processing. This is a new topic to me, and I'm going to post the experiments I run and what I'm sure will be many fascinating discoveries ;). I'm applying to formally study this stuff at Brigham Young University, and probably everything I post here will be the result of work done there, at least for the foreseeable future. This is a journal or something like that.
The first thing I'm working on is the paper Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality by Lau et al (http://www.aclweb.org/anthology/E14-1056). This paper is built on research captured in Chang et al.'s Reading Tea Leaves: How Humans Interpret Topic Models (http://www.cs.columbia.edu/~blei/papers/ChangBoyd-GraberWangGerrishBlei2009a.pdf).
At this point in this journey I am the quintessential NLP n00b, but here's my understanding of the papers: Chang et al. is about generating probabilistic topic models and then seeing how humans evaluate those topic models. Are the generated topics any good? Lau et al. is about creating automated processes for evaluating topic models and then comparing how the machine-generated evaluations compare to human ones.
At this early stage I find this interesting, because so much of my computer science career up to this point has been founded on the idea that there is a "right" answer. That all seems to go out the window as soon as humans get involved. With no absolute standard of truth, we're left comparing algorithms to averaged human samples.
In any case, the experiment I'm performing is based on the work in Lau et al. The code used in that paper can be found at (https://github.com/jhlau/topic_interpretability). In examining that code, we came across https://github.com/jhlau/topic_interpretability/blob/master/ComputeWordCount.py#L25, indicating that as this system was generating its probability tables to be used in later calculations, this setting of `window_size` has a hard value of 20.
This piece of code looks for cooccurrence of words in a given text. That is, how likely is it that word A appears within 20 words of word B? In this code base, we're looking at how often generated topic words appear within these windows.
The experiment I'm running here is to study the effect that window size has on the quality of our evaluations. The process is something like this:
- Use the source data from Lau et al. to generate the same numbers published in the paper, to make sure I'm using the system correctly. There's a fair amount of pre-processing to be done here including (as mentioned in the paper)
- Rework the code base to make `window_size` variable
- Iterate through all the data using various `window_size` values and compare the results to the human-generated scores. Scores closer to the human values will be considered "better" than scores that aren't as close.
Prior to this writing, I've already done step #2. That can be found at https://github.com/juanpaco/topic_interpretability/tree/parameterize. If this experiment proves interesting and/or successful, I'll submit that update back to Dr. Lau as a pull request. (Incidentally, I've been working in industry for the past 8 years, and I think it's great to see tools like GitHub getting adopted in academia).
Today starts my dive into OpenNLP and Morpha. The next few posts will be tutorials on how to use those tools, since I'll be learning myself. Hopefully they'll be of use to someone else.