CONCURRENT PROGRAMMING FOR EXTRACTIVE SUMMARIES IN PARE

Jesse Gardner,  Danielle Jaegers,  Scott Thede*

DePauw University, Computer Science, Greencastle, IN 46135

sthede@acm.org


Abstract

Introduction: First developed in 2002 by Lucas Kreger, Rachel Tornheim, and Scott Thede, PARE (Pruner and Redundancy Eliminator) is an automatic text-summarization program that uses extraction to summarize articles. The program creates a summary by “paring” away the non-essential article text. The program reads in an article and splits it into sentences. It then uses a link-grammar parser created at Stanford University to parse each sentence individually. PARE uses a modified Page Rank algorithm (like Google) to weight each word node based upon its links to other weighted words. These weights are then used to determine the most important sentences for the summary.

Improvements: This summer we focused on improving the efficiency of the program. Since PARE can be split into a few different individual processes, coding PARE in concurrent threads should increase speed. Since the mapper waits for all the sentences to be parsed, and each sentence can be parsed independently of the others, the best place for PARE to run in parallel is while parsing and mapping the sentences. Now the sentences can be parsed independently and mapped as they are parsed.

We created two new Runnable classes for the parsing and mapping threads designed to work with the parser interface already developed in PARE. To ensure that all sentences will be parsed and mapped only once, we created a synchronized index that allows multiple parser threads to locate the sentences to parse within the list of sentences and ensures that each sentence is parsed only once. These threads then drop the index of their newly parsed sentence into a synchronized queue that the mapping thread uses to place the sentences into the word map.

To test differences in speed we ran the 2007 version of PARE and the threaded PARE program with anywhere from one to five parser threads on the same articles. Hypothesizing that more processors would mean more computing power for more threads, we ran these instances of PARE on three different machines: one Fedora Linux machine with 2 Intel Pentium D processors running at 3.2 GHz, one Mac OS 10.4 machine with 4 PowerPC G5’s running at 2.5 GHz, and one Linux cluster with 6 Dual Xeon processors (12 total processors) running at 3.6GHz.

Results: The serial version of PARE and the threaded version with one parser thread were the fastest on the 2-core Linux machine and comparable in time to each other. On the 4-core Macintosh we found that (contrary to our hypothesis) the one-parser thread version severely out-performed the three-parser thread version.

On the cluster, three parser threads finished in 88 seconds on average and the serial version finished in 126 seconds, a 33% decrease. Adding a forth or fifth thread increased the total run time by only 2-6 seconds. Since it takes 4 seconds to load the grammar file each time a new parser thread is initialized on the cluster, this increase is explainable.

Conclusions: We found that in general, the more processors, the faster the threaded program would run, and the more threads it could handle; however our data also showed that our project still has many imperfections to overcome. The results from the 12-core Linux cluster showed a decrease with multiple parser threads, but the Mac results showed the opposite. Having three parser threads running actually means that the program will split into five threads, taking into account the main and mapping threads. The fact that there was one more thread than there were processors could have accounted for the speed loss.

Future Work: Reading in the grammar file for the parser sequentially each time a parser thread is started causes program efficiency to drop. We recommend reading in the grammar file once and allowing multiple parser threads to read from it. Our research also uncovered a divide-and-conquer method for large Page Rank situations, which could be implemented into PARE in the future. One could also preprocess the sentences as they are being read in from the file, though this change may not increase PARE’s speed. Creating some form of anaphora resolution that works with the current parser would also significantly increase the quality of the summaries PARE outputs.

This work was supported by National Science Foundation Grant Number IIS-0552370.

Download

[Abstract (DOC)]