(1) the complete title of one (or more) paper(s) published in the open literature describing the work that the author claims describes a human-competitive result

"A systematic study of automated program repair: Fixing 55 out of 105
  bugs for $8.00 each."  International Conference on Software
  Engineering (ICSE'12). 

``Representations and Operators for Improving Evolutionary Software Repair.''   Genetic and Evolutionary Computation
  Conference  (GECCO 2012). 


(2) the name, complete physical mailing address, e-mail address, and phone number of EACH author of EACH paper,

(alphabetical order)

Michael Dewey-Vogt
Computer Science Department 
University of Virginia 
85 Engineer’s Way
P.O. Box 400740
Charlottesville, VA 22904-4740
mkd5m@cs.virginia.edu
434.982.2200

Stephanie Forrest
Computer Science Department
MSC01 1130
1 University of New Mexico 
Albuquerque, NM 87131
forrest@cs.unm.edu
505-277-7104

Claire Le Goues
Computer Science Department 
University of Virginia 
85 Engineer’s Way
P.O. Box 400740
Charlottesville, VA 22904-4740
legoues@cs.virginia.edu
434.982.2200

Westley Weimer
Computer Science Department 
University of Virginia 
85 Engineer’s Way
P.O. Box 400740
Charlottesville, VA 22904-4740
weimer@cs.virginia.edu
434.982.2200


(3) the name of the corresponding author (i.e., the author to whom notices will be sent concerning the competition),

Stephanie Forrest

(4) the abstract of the paper(s),

ICSE'12: There are more bugs in real-world programs than human programmers can realistically address. This paper evaluates two research questions: “What fraction of bugs can be repaired automatically?” and “How much does it cost to repair a bug automatically?” In previous work, we presented GenProg, which uses genetic programming to repair defects in off-the-shelf C programs. To answer these questions, we: (1) propose novel algorithmic improvements to GenProg that allow it to scale to large programs and find repairs 68% more often, (2) exploit GenProg’s inherent parallelism using cloud computing resources to provide grounded, human- competitive cost measurements, and (3) generate a large, indicative benchmark set to use for systematic evaluations. We evaluate GenProg on 105 defects from 8 open-source programs totaling 5.1 million lines of code and involving 10,193 test cases. GenProg automatically repairs 55 of those 105 defects. To our knowledge, this evaluation is the largest available of its kind, and is often two orders of magnitude larger than previous work in terms of code or test suite size or defect count. Public cloud computing prices allow our 105 runs to be reproduced for $403; a successful repair completes in 96 minutes and costs $7.32, on average.

GECCO'12: Evolutionary computation is a promising technique for automating time-consuming and expensive software mainte- nance tasks, including bug repair. The success of this approach, however, depends at least partially on the choice of representation, fitness function, and operators. Previous work on evolutionary software repair has employed different approaches, but they have not yet been evaluated in depth. This paper investigates representation and operator choices for source-level evolutionary program repair in the GenProg framework [17], focusing on: (1) representation of individ- ual variants, (2) crossover design, (3) mutation operators, and (4) search space definition. We evaluate empirically on a dataset comprising 8 C programs totaling over 5.1 mil lion lines of code and containing 105 reproducible, human-confirmed defects. Our results provide concrete suggestions for operator and representation design choices for evolutionary program repair. When augmented to incorporate these suggestions, GenProg repairs 5 additional bugs (60 vs. 55 out of 105), with a decrease in repair time of 17–43% for the more difficult repair searches.

(5) a list containing one or more of the eight letters (A, B, C, D, E, F, G, or H) that correspond to the criteria (see above) that the author claims that the work satisfies,

D, G

(6) a statement stating why the result satisfies the criteria that the contestant claims (see examples of statements of human-competitiveness as a guide to aid in constructing this part of the submission),

Rationale for D: GenProg repairs bugs less expensively than documented bug bounties.  To cope with the problem that there are not enough software developer resources to fix all known software defects, many companies have begun offering "bug bounties" to outside developers, paying for human-developed candidate bug repairs.  Well-known companies such as Mozilla (http://www.mozilla.org/security/bug-bounty.html, $3,000/bug) and Google (http://blog.chromium.org/2010/01/encouraging-more-chromium-security.html, $500/bug) offer significant rewards for security fixes, with bounties raising to thousands of dollars in "bidding wars'' (http://www.computerworld.com/s/article/9179538/Google_calls_raises_Mozilla_s_bug_bounty_for_Chrome_flaws).  Although security bugs command the highest prices, more wide-ranging bounties are available. Consider the more indicative case of Tarsnap.com (http://www.tarsnap.com/bugbounty.html), an online backup provider.  Over a four-month period, Tarsnap paid $1,625 for fixes for issues ranging from cosmetic errors (e.g., typos in source code comments), to general software engineering mistakes (e.g., data corruption), to security vulnerabilities.  Of the approximately 200 candidate patches submitted to claim various bounties, about 125 addressed spelling mistakes or style concerns, while about 75 addressed more serious issues, classified as "harmless'' (63) or "minor'' (11).  One issue was classified as "major.''  Developers at Tarsnap confirmed corrections by manually evaluating all submitted patches.  If we treat the 75 non-trivial repairs as true positives (38 %) and the 125 trivial reports as overhead, Tarsnap paid an average of $21 for each non-trivial repair and received one about every 40 hours.  Despite the facts that the bounty pays a small amount even for reports that do not result in a usable patch and that about 84 % of all non-trivial submissions fixed "harmless'' bugs, the final analysis was: "Worth the money?  Every penny." (http://www.daemonology.net/blog/2011-08-26-1265-dollars-of-tarsnap-bugs.html).  Our results use evolutionary computation to automatically repair bugs with a use case similar to that of the outsourced “bug bounty hunters," for about 1/3 of the cost reported for Tarsnap.  Our cost figures are actual dollars spent using Amazon’s EC2 cloud computing infrastructure for the experiments. Each trial was given a “high-cpu medium (c1.medium) instance” with two cores and 1.7 GB of memory.  Simplifying a few details, the virtualization can be purchased as spot instances at $0.074 per hour but with a one hour start time lag, or as on-demand instances at $0.184 per hour.  These August–September 2011 prices summarize CPU, storage and I/O charges.

Rationale for G: Debugging is an indisputably difficulty activity, as anyone who has taught computer programming to humans knows. In fact, debugging and maintenance can account for up to 90% of the total cost of a typical software project, and the number of outstanding software defects exceeds the resources available to fix them.  The cited papers, taken together, report that our improved evolutionary computation method has successfully repaired 60 out of 105 bugs, each of which was deemed serious enough that a human produced a repair and committed to a source code repository.  These include bugs in programs totalling over 5 million lines of C source code with a wide range of functionality (language interpreters, mathematical routines, image manipulation, and web servers). Our results use a new representation (each chromosome is a list of edits, referred to as the "patch representation," rather than a complete abstract syntax tree) and rely on other improvements to the GP algorithm.  The ICSE'12 paper reports results showing that the patch representation improves the success rate for finding a repair by 68% on the original benchmark programs over those reported in two 1999 papers (ICSE'99 and GECCO'99).

Another possible metric is repair quality.  Our Hummie entry does not invoke a systematic study of repair quality, but we discuss repair quality anecodotally in the ICSE'12 paper.  A human study was recently published in a paper with overlapping authorship (Zachary P. Fry, Bryan Landau, Westley Weimer: A Human Study of Patch Maintainability. International Symposium on Software Testing and Analysis (ISSTA) 2012.  In this paper, human-written patches were compared to GP-generated patches, and it was
concluded that the combination of machine-generated patches with machine-generated documentaion were judged slightly more maintainable that human-generated
patches combined with human-generated documentation.

(7) a full citation of the paper (that is, author names; publication date; name of journal, conference, technical report, thesis, book, or book chapter; name of editors, if applicable, of the journal or edited book; publisher name; publisher city; page numbers, if applicable);

Claire  Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer "A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8.00 each."  International Conference on Software Engineering (ICSE'12), Zurich, Switzerland, June 2012.

Claire Le Goues, Westley Weimer, and Stephanie Forrest ``Representations and
  Operators for Improving Evolutionary Software Repair.''  Genetic and Evolutionary Computation
  Conference  (GECCO 2012). To appear, July 2012.

(8) a statement either that “any prize money, if any, is to be divided equally among the co-authors” OR a specific percentage breakdown as to how the prize money, if any, is to be divided among the co-authors; and

The prize money will be divided equally among the co-authors. 

(9) a statement stating why the judges should consider the entry as “best” in comparison to other entries that may also be “human-competitive.”

a. We directly measure the monetary cost of our method by running experiments in the Amazon cloud-computing environment.  The recent emergence of bug bounties provides an objective  market valuation for the cost of a successful bug repair.  Our automatically generated repairs cost no more than 1/3 of this market valuation and in some cases much less.  Our methodology for conducting the study was carefully designed, highly disciplined, and to our knowledge is the largest available of its kind---often two orders of magnitude larger than previous studies in terms of code size, test suite size, or number of bugs.   

b. The problem is significant according to almost any measure (see for example, R. C. Seacord, D. Plakosh, and G. A. Lewis. Modernizing Legacy Systems: Software Technologies, Engineering Process and Business Practices. Addison-Wesley Longman Publishing Co., Inc., 2003). According to the IDC Software Quality Survey (2008), US corporate development organizations spend $5.2 - $22 million annually fixing software defects, and some estimates show that software maintenance consumes 0.6% of U.S. GDP.

c. Our entry is based on significant improvements to the original algorithm. The ICSE'12 paper reports 68% improvement in success rate over our last Hummie entry (2 years ago) on the same set of benchmark programs.