OPENREVIEWER: MITIGATING CHALLENGES IN LLM REVIEWING (ICLR 2024 Submission)

Abstract

Human reviews are slow and of variable quality. Therefore, some people focus on using LLM to do reviewing.

Main Challenges:

1. risk of misuse

2. inflated review scores

3. overconfident ratings

4. skewed score distributions

5. limited prompt length

Their method:

  1. without prompt engineering by using LLM watermarking to mark LLM-generated reviews
  2. classifying and detection errors and shortcomings of papers
  3. using long-context windows that include the review form, entire paper, reviewer guidelines, code of ethics and conduct, area chair guidelies, and previous year statistics
  4. blind human evaluation of reviews.

Methods

  1. Dataset: ICLR 2023 4956 papers with 18565 reviews.
  2. Pipeline: During inference (solid lines), the authors upload a paper with the conference name. The paper is reviewed by OpenReviewer using GPT-4 with the conference statistics of the previous year, reviewer guidelines, code of ethics, code of conduct, full paper text, and review form. The LLM review is then watermarked and returned to the author, who provides feedback about the review.

Evaluation

  1. randomly shuffle the papers and choose 10% of the papers randomly
  2. randomly permute each of the reviews for each paper, which consists of three human reviews and three LLM reviews
  3. blindly evaluate the reviews by human experts. The experts do not know whether the review being evaluated was written by human or an LLM, and each expert answers four meta-questions about each review.

Prompt Samples

Summary: Summarize what the paper claims to contribute. Be positive and constructive.
Correctness: Please classify the paper on the following scale to indicate the correctness of the technical claims, experimental and research methodology and on whether the central claims of the paper are adequately supported with evidence. 4: All of the claims and statements are well-supported and correct. 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct. 2: Several of the paper’s claims are incorrect or not well-supported. 1: The main claims of the paper are incorrect or not at all supported by theory or empirical results.
Strenths and weaknesses: List strong and weak points of the paper. Be as comprehensive as possible.

Clarity, Quality, Novelty, and Reproducibil- ity: Please evaluate the clarity, quality, novelty, and reproducibility of the paper.

Summary of the review: Clearly state your initial recommendation (accept or reject) with one or two key rea- sons for this choice. Provide supporting arguments for your recommendation.

Some of my thoughts about reviewGPT

Prompt: From the review, list the strength and weaknesses from the following points: clarity, quality, novelty, and reproducibility

Prompt: From the last year’s all reviews, cluster the reivews and scores, have a model, then use this model to predict this year’s one paper submission, what is the review, and what is the score, and whether it would be accepted.

Leave a comment