07 December 2005 A really big, possibly largest issue in biography processing, of course, is the precise and accurate identification of every instance of reference to the protagonist. ("Euler", "Leonhard", "he") and it could be interesting to see if traditional coreferencing works better or if just a very good classification/identification scheme would be better. How would we evaluate/measure the difference in performance? Of course GATE and ANNIE already comes with a traditional, rule-based CO, (although the pronominal coreferencer always crashes---hopefully the memory trick works, I'm not in the syslab right now, I'll have to try this soon) and I think fastus has also a rule-based CO. It looks like ML, both the provided stuff in GATE, possibly other ideas including bootstrapping like in that paper, or other stuff like maxent, could provide more of a classification or NE-esque view into "protagonist resolution" Maybe we could make protagonist identification into a specialized subcategory of IE--- resumes would be a primary corporate interest, the corpus I have is biographies, and there would be other types of documents with a clearly defined protagonist (as lots of literature supposedly should be). And a generic "protagonist identification/coreference" could be a nice development, as an area that is subcategory of both NE and CO that feature prominently in documents. In addition, learning set protagonist resolution would definitely provide the most bang for the buck, (as we know the availability of training data is a big bottleneck in IE) simply because of the sheer number of times that the protagonist is mentioned in any biography. It's just a thought that I'm making up currently, but I've given a lot of thought into thinking how can this project be different than a lot of the IE stuff going on at Sheffield, etc. and protagonist resolution is certainly something I haven't heard of, and right now systems would reply on NE and CO, but since CO is so unreliable, possibly a protagonist-specific module combining NE and CO could do much better, and if it does (and I'm sure it will mainly because traditional CO just sucks), be a huge boon for IE for some types of documents (like resume, biography, etc.) Above from email to Mr. Parker today "protagonist information extraction" "protagonist resolution" "biography information extraction or "resume information extraction" return "No documents match Boolean query. Trying non-Boolean relevance query" So nobody has done this. Umm, looking at GATE ML's config file thing. The config in itself is pretty simple and just uses WEKA magic, but that means I need to install the WEKA thing.