1. the complete title of one (or more) paper(s) published in the open literature describing the work that the author claims describes a human-competitive result;

Evolving Ranking-Based Failure Proximities for Better Clustering in Fault Isolation

Human Competitive Results Awards - HUMIES 2023

------------------------------------------------------------------------------- 

2. the name, complete physical mailing address, e-mail address, and phone number of EACH author of EACH paper(s);

Yi Song
Wuhan University
School of Computer Science
430072
299 Bayi Road, Wuchang District, Wuhan, Hubei
China
e-mail: yisong@whu.edu.cn
tel: +86-18871765620

Xiaoyuan Xie 
Wuhan University 
School of Computer Science
430072
299 Bayi Road, Wuchang District, Wuhan, Hubei
China
e-mail: xxie@whu.edu.cn 
tel: +86-27-6877-6027

Xihao Zhang
Wuhan University
School of Computer Science
430072
299 Bayi Road, Wuchang District, Wuhan, Hubei
China
e-mail: zhangxihao@whu.edu.cn
tel: +86-18809710176

Quanming Liu
Wuhan University
School of Computer Science
430072
299 Bayi Road, Wuchang District, Wuhan, Hubei
China
e-mail: liuquanming@whu.edu.cn
tel: +86-18833810869

and 

Ruizhi Gao
Sonos Inc.
93101
614 Chapala Street, Santa Barbara, CA
USA
e-mail: youtianzui.nju@gmail.com
tel: +1-2145161093

------------------------------------------------------------------------------- 


3. the name of the corresponding author (i.e., the author to whom notices will be sent concerning the competition);

Xiaoyuan Xie

------------------------------------------------------------------------------- 


4. the abstract of the paper(s); 

Failures that are not related to a specific fault can reduce the effectiveness of fault localization in multi-fault scenarios. To tackle this challenge, researchers and practitioners typically cluster failures (e.g., failed test cases) into several disjoint groups, with those caused by the same fault grouped together. In such a fault isolation process that requires input in a mathematical form, ranking-based failure proximity (R-proximity) is widely used to model failed test cases. In R-proximity, each failed test case is represented as a suspiciousness ranking list of program statements through a fingerprinting function (i.e., a risk evaluation formula, REF). Although many off-the-shelf REFs have been integrated into R-proximity, they were designed for single-fault localization originally. To the best of our knowledge, no REF has been developed to serve as a fingerprinting function of R-proximity in multi-fault scenarios. For better clustering failures in fault isolation, in this paper, we present a genetic programming-based framework along with a sophisticated fitness function, for evolving REFs with the goal of more properly representing failures in multi-fault scenarios. By using a small set of programs for training, we get a collection of REFs that can obtain good results applicable in a larger and more general scale of scenarios. The best one of them outperforms the state-of-the-art by 50.72% and 47.41% in faults number estimation and clustering effectiveness, respectively. Our framework is highly configurable for further use, and the evolved formulas can be directly applied in future failure representation tasks without any retraining.

-------------------------------------------------------------------------------

5. a list containing one or more of the eight letters (A, B, C, D, E, F, G, or H) that correspond to the criteria (see above) that the author claims that the work satisfies;

(A) The result was patented as an invention in the past, is an improvement over a patented invention, or would qualify today as a patentable new invention.

(B) The result is equal to or better than a result that was accepted as a new scientific result at the time when it was published in a peer-reviewed scientific journal.

(D) The result is publishable in its own right as a new scientific result  independent of the fact that the result was mechanically created.

(E) The result is equal to or better than the most recent human-created solution to a long-standing problem for which there has been a succession of increasingly better human-created solutions.

(F) The result is equal to or better than a result that was considered an achievement in its field at the time it was first discovered.

(G) The result solves a problem of indisputable difficulty in its field.


------------------------------------------------------------------------------- 


6. a statement stating why the result satisfies the criteria that the contestant claims (see examples of statements of human-competitiveness as a guide to aid in constructing this part of the submission);

(A) We have applied for an invention patent in China for the genetic programming (GP)-based technique proposed in our entry (Patent ID: CN202211345882.6). This application has passed the preliminary examination and is currently being further reviewed [1].

(B) The primary contribution of this entry is to employ GP to automatically generate better solutions for failure indexing, which is an important task in software testing and debugging. The SOTA solution of this task requires a formula to calculate suspiciousness ranking lists for failures, as such each failure can be first represented as a ranking list and then indexed to their root causes. Thus, one of the core components here is the formula. Our method automatically evolves formulas that can deliver satisfactory indexing results. We compare our evolved solutions with previously human-designed and GP-evolved formulas that have been published in peer-reviewed scientific venues [2,3]. Our evolved solutions are demonstrated to be very promising: all the ten evolved solutions outperform previously human-designed and GP-evolved ones, and the best evolved solution achieves improvements of 50.72% and 47.41% compared with the best publicized solution to date, in terms of two evaluation metrics, respectively. Notice that the selected baselines in our entry, for example, Wong2 [4], Crosstab [5], DStar [6], and GP19 [3], were all proposed priorly in peer-reviewed journals or conferences. Each of them is a classical and highly-effective formula, and their advancement has been theoretically proven or empirically demonstrated [2,7,8]. Our evolved solutions by GP can surpass them with a remarkable increment.

More importantly, the evolution process is performed on only a small part of the benchmark, while the test is carried out on a larger portion of the benchmark, demonstrating that GP has the potential of "evolving with a lower cost while applied well in a broader environment". As a reminder, even though our formulas are evolved from manually-seeded mutants, they can be as equivalently promising as the previous simulation environment when input in the real world. This demonstrates that GP has the potential of "evolving in a greenhouse while applied well in practice".

(D) The evolution approach used for generating formulas in our entry, as well as the ten evolved formulas, have been published in the 2022 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’ 22). Furthermore, we carefully organize the source code of our evolution approach and the evolved formulas into a public repository [9], for any potential intention of reuse. This repository has passed the peer-review process of the Artifact Evaluation track of ASE’ 22, and has been awarded an Available and a Reusable badges.

Notice that once the GP evolution process is completed, the produced formulas will be independent of the GP algorithm. That is to say, our GP-evolved formulas themselves are an independent contribution to the field of software debugging, and thus stakeholders can directly adopt them in future failure indexing tasks without any retraining.

(E) In the past, a collection of formulas was designed for spectrum-based fault localization (SBFL), the goal of which is to pinpoint the entity of being faulty of an abnormal program. According to the principle of failure indexing, all SBFL formulas can be expropriated for representing failures in failure indexing. Actually, according to our investigation, the community of failure indexing does use these SBFL formulas to represent failures. For example, the widely-used formulas include Tarantula, which was proposed by Jones et al. in 2001 [10,11], and Crosstab, which was proposed by Wong et al. in 2011 [5], etc. Not only that, since the earliest SBFL formula was manually created by human being, a large number of formulas have been produced either manually [2,8] or using GP [3,7]. Generally speaking, there are 35 most commonly-used formulas to date, among them, 32 were designed by human being for a variety of purposes and the remaining three were evolved by GP. All of them can be adapted for failure representation tasks. It is worth mentioning that the ten evolved formulas in our entry are all superior to these 35 baselines, and the extent of improvement is significant.

Specifically, the best of our ten evolved formulas (i.e., EFF10-83) can outperform the best group of the 35 baselines (i.e., Group4, which includes Wong2 [4], Hamann [8], Simple Matching [8], Sokal [8], Rogers & Tanimoto [8], Hamming etc. [8], and Euclid [8]). The evaluation metrics we use to measure the capability of a formula in failure indexing are FVe and SUMfs. For the metric FVe (i.e., based on a specific formula, the number of faults of how many faulty programs can be correctly predicted), EFF10-83 exceeds Group4 by 50.72%. And for the metric SUMfs (i.e., the effectiveness of clustering on those faulty programs that the number of faults is correctly predicted), EFF10-83 exceeds Group4 by 47.41%.

(F) In our experiment, we select 35 existing formulas designed manually or by GP previously as baselines. Actually, these 35 formulas were proposed one after another over the last few decades, and have been widely used by the software testing and debugging community. When each of them was first published, it was considered an achievement. For example, the authors of DStar stated that “DStar is more effective at locating faults than all the other techniques it is compared to” in the publication [6]. And for another example, the failure indexing technique [12] using the formula Crosstab [5] is generally recognized as the state-of-the-art [13]. We employ GP to evolve formulas that solely serve for failure representation, and compare them with those prevalent off-the-shelf 35 formulas in failure indexing tasks, observing a remarkable improvement.

(G) Failure indexing, that is, mining the relationship between failures and underlying root causes, is a recognized longstanding challenge in the field of software testing and debugging [12-14]. The crux of failure indexing lies in how to properly represent failures [15,16]. In the past, researchers tended to employ the code coverage-based strategy to tackle this problem [17-19], but this tactic has been shown to fail to deliver satisfactory results [12,13,15]. Later, the statistical debugging-based strategy was proposed, which utilizes a formula to produce a ranking list for a specific failure, and uses the ranking list to represent this failure [15,20]. Although the statistical debugging-based strategy has been demonstrated to be the state-of-the-art [12], almost all studies using the statistical debugging-based strategy naively employed off-the-shelf SBFL formulas, which were presented for fault localization originally. Before, through a comprehensively empirical and preliminarily theoretical way, we have shown that formulas that perform well in fault localization do not necessarily also perform well in failure indexing [21], thus, creating formulas for failure representation only is of high importance. To the best of our knowledge, there is no prior work focusing on this topic.

Therefore, it is necessary to consider how to design a well-behaved formula for failure representation tasks. To solve this problem, we utilize genetic programming, a heuristic strategy, to automatically generate failure representation formulas. Our results demonstrate that the evolved solutions effectively alleviate the mentioned concern, with improvements of 50.72% and 47.41% compared with the existing best solution, regarding the two evaluation metrics, respectively.


------------------------------------------------------------------------------- 


7. a full citation of the paper (that is, author names; title, publication date; name of journal, conference, or book in which article appeared; name of editors, if applicable, of the journal or edited book; publisher name; publisher city; page numbers, if applicable);

Yi Song, Xiaoyuan Xie, Xihao Zhang, Quanming Liu, and Ruizhi Gao. 2023. Evolving Ranking-Based Failure Proximities for Better Clustering in Fault Isolation. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE '22). Association for Computing Machinery, New York, NY, USA, Article 41, 1–13. https://doi.org/10.1145/3551349.3556922

------------------------------------------------------------------------------- 


8. a statement either that "any prize money, if any, is to be divided equally among the co-authors" OR a specific percentage breakdown as to how the prize money, if any, is to be divided among the co-authors;

Any prize money, if any, is to be split equally between the co-authors of the cited paper.

------------------------------------------------------------------------------- 


9. a statement stating why the authors expect that their entry would be the "best" 

Multi-fault localization is a widely-recognized challenge in the field of software debugging, and failure indexing is a reasonable way to alleviate this crux. In failure indexing, how to properly represent failures in a mathematical and structural form is of great importance. The most advanced strategy for this problem to date is the statistical debugging-based strategy, in which a failure will be converted to a suspiciousness ranking list based on a formula. Although numerous formulas have been presented over the last decades, they are from the aspect of fault localization. The difficulty of creating formulas for failure representation lies in its obscure goal: it is hard to precisely determine whether a formula works well in failure representation. Considering that the nature of heuristics of genetic programming techniques could help in alleviating such a challenge, this entry for HUMIES uses GP to automatically produce formulas for failure representation, achieving a promising result. This entry not only outperforms previously published peer-reviewed human scientific achievements and GP achievements in the field, it goes much further: It 1) attempts to run GP with a small scale of samples while test the evolved solutions in a more general environment, and 2) attempts to run GP in a simulated environment while test the evolved solutions in practice.

In conclusion, our claims are as follows:

- We provide empirically human-competitive results in automated debugging, which even surpass prior GP results.

- We perform the GP evolution process on a small portion of the benchmark, and the test is carried out on a larger portion of the benchmark. The promising results demonstrate that GP has the potential of "evolving with a lower cost while applied well in a broader environment".

- We perform the GP evolution process on manually-seeded mutants, and the test is carried out in real-world environments. The results demonstrate that GP has the potential of "evolving in a greenhouse while applied well in practice".


------------------------------------------------------------------------------- 


10. An indication of the general type of genetic or evolutionary computation used, such as GA (genetic algorithms), GP (genetic programming), ES (evolution strategies), EP (evolutionary programming), LCS (learning classifier systems), GI (genetic improvement), GE (grammatical evolution), GEP (gene expression programming), DE (differential evolution), etc.

GP (genetic programming)


------------------------------------------------------------------------------- 

11. The date of publication of each paper.

05 January 2023

-------------------------------------------------------------------------------

References: 

[1] Xiaoyuan Xie, Yi Song, Xihao Zhang and Quanming Liu. A ranking similarity-based modeling approach for software failures and computer-readable media. China Invention Patent: CN202211345882.6, 2023.

[2] Xiaoyuan Xie, Tsong Yueh Chen, Fei-Ching Kuo and Baowen Xu. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization[J]. ACM Transactions on software engineering and methodology, 2013, 22(4): 1-40.

[3] Shin Yoo. Evolving human competitive spectra-based fault localisation techniques[C]//Search Based Software Engineering: 4th International Symposium, 2012: 244-258.

[4] W. Eric Wong, Yu Qi, Lei Zhao and Kai-Yuan Cai. Effective fault localization using code coverage[C]//31st Annual International Computer Software and Applications Conference, 2007, 1: 449-456.

[5] W. Eric Wong, Vidroha Debroy and Dianxiang Xu. Towards better fault localization: A crosstab-based statistical approach[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2011, 42(3): 378-396.

[6] W. Eric Wong, Vidroha Debroy, Yihao Li and Ruizhi Gao. The DStar method for effective software fault localization[J]. IEEE Transactions on Reliability, 2013, 63(1): 290-308.

[7] Shin Yoo, Xiaoyuan Xie, Fei-Ching Kuo, Tsong Yueh Chen and Mark Harman. Human competitiveness of genetic programming in spectrum-based fault localisation: Theoretical and empirical analysis[J]. ACM Transactions on Software Engineering and Methodology, 2017, 26(1): 1-30. 

[8] Lee Naish, Hua Jie Lee and Kotagiri Ramamohanarao. A model for spectra-based software diagnosis[J]. ACM Transactions on software engineering and methodology, 2011, 20(3): 1–32.

[9] Yi Song, Xiaoyuan Xie, Xihao Zhang, Quanming Liu and Ruizhi Gao. “Artifact for the Paper ‘Evolving Ranking-Based Failure Proximities for Better Clustering in Fault Isolation’”. Github Repository, 2022. [Online]. Available: https://github.com/yisongy/SRR-GP. doi: https://doi.org/10.5281/zenodo.7034690. 

[10] James A. Jones and Mary Jean Harrold. Empirical evaluation of the tarantula automatic fault-localization technique[C]. In Proceedings of the 20th International Conference on Automated Software Engineering, 2005: 273–282.

[11] James A. Jones, Mary Jean Harrold and John T. Stasko. Visualization for fault localization[C]. In Proceedings of ICSE Workshop on Software Visualization, 2001: 71–75.

[12] Ruizhi Gao and W. Eric Wong. MSeer—An Advanced Technique for Locating Multiple Bugs in Parallel[J]. IEEE Transactions on Software Engineering, 2019, 45(03): 301-318.

[13] Abubakar Zakari and Sai Peck Lee. Parallel debugging: An investigative study[J]. Journal of Software: Evolution and Process, 2019, 31(11): e2178.

[14] Gabin An, Juyeon Yoon, Jeongju Sohn, Jingun Hong, Dongwon Hwang and Shin Yoo. Automatically identifying shared root causes of test breakages in SAP HANA[C]//Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, 2022: 65-74.

[15] Chao Liu, Xiangyu Zhang and Jiawei Han. A systematic study of failure proximity[J]. IEEE Transactions on Software Engineering, 2008, 34(6): 826-843.

[16] Chao Liu, Xiangyu Zhang, Jiawei Han, Yu Zhang and Bharat K. Bhargava. Indexing noncrashing failures: A dynamic program slicing-based approach[C]//2007 IEEE International Conference on Software Maintenance, 2007: 455-464.

[17] Wolfgang Högerle, Friedrich Steimann and Marcus Frenkel. More debugging in parallel[C]//2014 IEEE 25th International Symposium on Software Reliability Engineering, 2014: 133-143.

[18] Yanqin Huang, Junhua Wu, Yang Feng, Zhenyu Chen and Zhihong Zhao. An empirical study on clustering for isolating bugs in fault localization[C]//2013 IEEE International Symposium on Software Reliability Engineering Workshops, 2013: 138-143.

[19] Friedrich Steimann and Marcus Frenkel. Improving coverage-based localization of multiple faults using algorithms from integer linear programming[C]//2012 IEEE 23rd International Symposium on Software Reliability Engineering, 2012: 121-130.

[20] Chao Liu, Xifeng Yan, Long Fei, Jiawei Han and Samuel P. Midkiff. SOBER: statistical model-based bug localization[J]. ACM SIGSOFT Software Engineering Notes, 2005, 30(5): 286-295.

[21] Yi Song, Xiaoyuan Xie, Quanming Liu, Xihao Zhang and Xi Wu. A Comprehensive Empirical Investigation on Failure Clustering in Parallel Debugging[J]. Journal of Systems and Software, 2022, 193: 111452.