Category Archives: AI

Could LLMs like ChatGPT ever replace part of the academic peer-review process?

Recently, I made a comment over on Steve Heard’s Scientist Sees Squirrel blog:

I have never published a paper that’s not been improved, to some degree, by peer review, and broadly the system works. But I do wonder if it’s sustainable in the long-term and whether in the future LLMs might actually be a more effective way of assessing manuscripts. I recognise that’s (currently) a controversial statement to make – but having recently run a few of my own manuscripts through ChatGPT and asked for its “opinion”, I can honestly say that the feedback has improved not just the writing but also the framing and focus of the work. It’s also picked up weaknesses and errors that I had otherwise missed.

That initiated an email conversation with Steve which resulted in me running a short experiment with ChatGPT model 5.5. I first loaded up the original manuscript that I’d submitted to a journal of this paper on pollinator effectiveness. I then asked ChatGPT to write a review of the manuscript as though it was a peer reviewer of the journal. Which it did – in some detail – in 28 seconds! If anyone is interested I can send them that review, but it’s the next bit that I think is especially interesting.

After ChatGPT had completed the review, I then uploaded the actual peer reviews I’d received from the journal, plus the editor’s comments, and asked it to summarise the degree to which its review agreed with those I had received.

Below I have copied the response, but it’s the take-home message, summarised in the graphic above, which I think is the most intriguing aspect: there was a 55% agreement with Reviewer 1, a 70% agreement with Reviewer 2, and a 65% overall agreement between ChatGPT’s assessment of the manuscript and that of the Editor.

That may not sound like a large overlap, but note that it’s the level of agreement between the two reviewers was only 65%! In my not inconsiderable experience (>130 peer reviewed papers and chapters, hundreds of peer reviews of my own) this is fairly typical – reviewers ALWAYS pick up on different strengths and weaknesses in a manuscript!

What particularly impressed me about ChatGPT’s response below, is that it went far beyond simply looking at the language being used, it provided a critique of how the study was conducted, how data were analysed, and the conclusions drawn. Which is just what we expect a human reviewer to do! Not only that, but it was more critical of the work than the reviewers and the editor.

So where does this leave us? Could Large Language Models (LLMs) like ChatGPT replace part of the academic peer-review process? Clearly there need to be guide rails and human agency involved in final decisions, but I think that, yes, they may well have a role to play in the not-too-distant future. As always, I’d be interested in your comments.

If you want to dive into this more deeply, here’s the verbatim account of what ChatGPT wrote (including its emboldened parts) when I asked how much agreement there was between its assessment and those of the other two reviewers and the editor:

There is a fair amount of overlap, but not a perfect match.

I would say my review was moderately similar in substance, but noticeably more critical in tone and in editorial recommendation. The strongest areas of convergence were on the paper’s core strengths and on one key statistical concern. The biggest mismatch is that the actual editor and reviewers were much more positive overall, and most of their requested changes are closer to clarification, presentation, and framing than to a deep challenge to the paper’s publishability.

The clearest points of agreement are these. First, I highlighted the manuscript’s strengths as the multi-year design, the focus on different functional groups, and the interesting role of Lepidoptera; that aligns very closely with the guest editor, who explicitly says that “the roles of lepidopterans, the inclusion of flight distance measures, and the multi-year study are particularly notable,” and with Reviewer 2, who lists the multi-year field study, multiple pollinator groups, and the unexpected importance of Lepidoptera as strengths. Second, I flagged the Poisson treatment of an ordinal flight-distance variable as statistically weak; Reviewer 1 independently makes essentially the same point and explicitly recommends ordinal regression instead. Third, I thought the discussion was somewhat over-extended and not always tightly aligned with the evidence; Reviewer 2 likewise suggests embedding the work better in the broader literature and moving some material out of the conclusion into the discussion.

There is also a partial overlap around the manuscript’s central interpretation. I argued that the paper sometimes pushes too hard from pollen deposition and coarse movement classes to larger claims about true generalisation, outcrossing value, and broader evolutionary or environmental buffering. The editor does not object to those broad claims in the same way; instead, the editor actually asks for the discussion to connect the results more explicitly to the theoretical statement that plants should generalise on equally effective pollinators under dynamic environments, and even invites discussion of anthropogenic disturbance and future interactions. In other words, where I wanted that interpretive arc toned down, the editor mostly wanted it better developed and better supported.

The main differences are quite important.

First, I was more sceptical about inference than the real reviewers were. I was worried that the manuscript moved too quickly from non-significant or coarse flight-distance patterns to statements about butterfly superiority, reduced geitonogamy, and broader adaptive significance. Reviewer 1 only objected narrowly to the modelling choice; neither reviewer makes my broader inferential critique, and the editor seems broadly comfortable with the interpretive direction so long as it is supported with more descriptive data and clearer exposition.

Second, I was more concerned about the coarse pooling of visitor taxa. I thought the manuscript should be more cautious because “butterflies,” “hoverflies,” “bumblebees,” and especially “Others” are broad bins. The actual comments do touch this, but in a softer, more taxonomic way: the editor asks whether there is a species or morphotype list, and Reviewer 2 asks for more detail on the butterfly taxa and justification for putting solitary bees into “Others.” That is related to my concern, but it is narrower and more presentational than the conceptual objection I raised.

Third, I did not anticipate how much of the real feedback would focus on descriptive content and figures. The editor repeatedly asks for more floral biology data, clearer results presentation, more absolute numbers, more behavioural quantification, clearer site description, and better figure design and consistency. Reviewer 2 similarly asks for effect sizes, better figures, units, and clearer display of the underlying data distribution. I mentioned presentational issues only in a general way; the actual reports are much more detailed and concrete here.

Fourth, I did not raise the absence of random effects, whereas Reviewer 2 does. That is probably the single most important substantive point that I missed. Reviewer 2 specifically asks why insect individual and/or observation day were not modelled as random effects.

Fifth, my editorial recommendation was harsher. I recommended major revision. The actual reports read much more like minor-to-moderate revision. The editor calls it “an excellent piece of work” and says it is well executed with robust methodology; Reviewer 2 says they have “little to criticise,” mostly editorial suggestions plus a few statistical and figure issues; Reviewer 1 has only a short list of points. That collective tone is clearly more encouraging than mine.

So, in condensed form:

Where I matched well: multi-year dataset as a major strength; Lepidoptera result as a notable contribution; the Poisson-for-ordinal issue; some concern that discussion/framing needed tightening.
Where I was too severe: I treated the interpretive overreach as a major obstacle, whereas the real reviewers mostly saw a strong paper needing clarification and enrichment rather than retrenchment.
What I underplayed or missed: requests for more descriptive natural history and floral biology data, much more detailed figure/presentation criticism, and the random-effects issue.

My bottom-line judgement is that my review was directionally right on some key scientific points, especially the statistical issue and the general strengths, but too pessimistic about the paper’s editorial position. If I had seen these reports first, I would have shifted from “major revision because the evidence-conclusion fit is shaky” to something more like “moderate revision: good paper, but it needs clearer descriptive support, better figure presentation, and tighter statistical justification. If revised along those lines, I think the manuscript could become a credible and interesting contribution”.

Here is a concise version of the editorial decision in journal-review language:

This manuscript addresses an interesting and worthwhile question using a useful multi-year dataset and a stronger-than-usual measure of pollinator performance. However, the current version over-interprets some of the results, particularly those concerning movement behaviour, outcrossing implications, and the broader adaptive significance of generalised pollination. The statistical treatment of flight distance is also not fully convincing. I therefore recommend major revision. The manuscript has clear potential, but its conclusions need to be more tightly aligned with what the data actually demonstrate.

Monkey business at the SCAPE conference

7 Replies

I’d like to tell you about a dream that I had last night. As far as I can recall this is the first time I’ve mentioned my dreams in about 14 years of regular blogging. I dream almost every night: vivid, highly immersive, realistic, often weird, sometimes scary, frequently funny dreams that, each morning, I can usually recall in some detail. That’s always been the case, ever since I was a small child.

If I drink beer or eat blue cheese my dreams become stronger and weirder, presumably because of the chemicals produced by the hops (which belong to the same plant family as cannabis) or the fungi. But regardless of what I ingest, I’m almost guaranteed to dream.

Last night’s dream involved the SCAPE meeting. In the dream, some colleagues had submitted the abstract for a talk at the conference and, when I checked it, I discovered that it was full of spelling and grammatical errors. So I did what I normally do – I started to revise the document. Suddenly, I found myself at the said conference and my colleagues were telling me not to change anything because it had been written by the first author – a gorilla* called Merrill.

I started to argue that, regardless of our semi-literate co-author, we ought to make some corrections, when Merrill looked at me with his big, dark, doleful eyes. So I reached over and scratched his head, which he seemed to enjoy. I can still recall the texture of his short, wiry hair under my fingers, because at that exact moment I woke up thinking…..WTF?!

Dreams such as this often have some basis in things I’ve seen or read about or done, so I spent the morning thinking about what could have prompted it. And I believe that I know what it was. There’s been a lot of discussion recently about scientific paper authorship and responsibility – not least in the context of AI – and I’ve seen stories about research papers with non-humans, such as pets, as co-authors. So was my brain sublimating these ideas into a fantasy about having a gorilla as a co-author? Who knows. It was an amusing way to spend my sleeping hours, though.

*Before anyone comments that “gorillas are apes, they’re not monkeys, the title of your post is incorrect”, I’d like to point out that, phylogenetically speaking, apes (including ourselves) ARE monkeys in the sense that they (we) are nested within that larger grouping of primates.

Join me on 26th February in Leicester for a talk: “Adventures in Pollination!”

Should AI be thought of as a science tool or collaborator? A conversation with ChatGPT

9 Replies

In a piece that I posted before Christmas, I made the offhand comment that “we need to think of AI as a collaborator as much as a tool“. It’s offhand in the fact that I’d not thought too deeply about it, but some reading that I did over the holiday made me rethink this statement, or at least consider it more critically. The most recent post on the Dynamic Ecology blog – Meghan Duffy asking about using AI to find literature – also got me thinking about whether AI is just a tool.

Eventually I decided to ask the question of ChatGPT itself and we ended up having quite a long conversation on this topic. I then asked it to summarise the conversation and what you see below is my edited version of that summary:

Jeff: Is ChatGPT best thought of as a tool or a collaborator in science?
ChatGPT: ChatGPT is best understood as a tool with collaborator-like affordances. It supports analysis, writing, coding, and idea development, but it does not participate in collaboration as a social, epistemic, and accountable practice. It is also prone to errors.

Jeff: But human collaborators also make errors and sometimes avoid accountability. We do not expect people to be flawless.
ChatGPT: Exactly. The issue is not error versus error-free performance. It is about agency and accountability: human collaborators can explain their reasoning, revise their views, and be held responsible for their contributions.

Jeff: You said that ChatGPT does not participate in collaboration as a “social, epistemic, and accountable practice”. Yet you can still generate genuinely useful ideas and connections.
ChatGPT: Yes, and that is why calling ChatGPT part of the infrastructure (as I did in our longer conversation) is misleading. Unlike laptops or software platforms, I can actively generate ideas, cross-fertilise between disparate areas, and help explore conceptual space.

Jeff: If I ask a colleague to write R code, that is collaboration. If I ask you to write R code, why is that different?
ChatGPT: Because a human colleague brings intent, methodological judgement, and ownership of decisions. I generate code instrumentally, without understanding the scientific stakes or assuming responsibility for analytical choices.

Jeff: Earlier you said you cannot “stand behind a claim”. But you can cite papers and explain why something should be trusted. Is that really true?
ChatGPT: When I cite papers or summarise arguments, I reproduce the form of justification, but I still cannot stand behind a claim in the scientific sense. I do not judge evidential strength or accept responsibility; trust remains with the human scientist and the underlying sources.

Jeff: Yet the output of your code or analysis can be just as useful as that produced by a human collaborator.
ChatGPT: That is true. The distinction is not about usefulness or correctness, but about agency, authorship, and accountability within scientific practice.

Jeff: So what is ChatGPT, if it is neither a standard tool nor a collaborator?
ChatGPT: It is best described as a cognitive amplifier or dialogic system: something that expands the range of ideas and connections available to researchers, without itself being an accountable scientific agent.

Jeff: And what does that imply for how we should use ChatGPT in science?
ChatGPT: It implies using ChatGPT to provoke, explore, and accelerate thinking, while recognising that scientific judgement, responsibility, and ownership remain entirely human.

Jeff: Produce an image that I can use on the blog which summarises this discussion.
ChatGPT: Certainly.

As a test of its ability to “provoke, explore, and accelerate thinking”, at the end of last week I uploaded a manuscript that I had written, just prior to submission, and asked ChatGPT to tell me its strengths and weaknesses and fit for the journal. Its feedback was spot-on – far more than I would expect from a “tool” – and I made a few final tweaks before I submitted the manuscript. ChatGPT also said “Don’t be surprised if the reviewers ask you to change X, Y and Z” and so it will be interesting to see whether it’s correctly gauged the manuscript and the reviewers’ responses.

What times we live in! At the moment I’m optimistic enough about AI to see all of this as an intriguing exploration of the capabilities of these large language models, an expedition through dense habitat in which we’ve barely left base camp and our view of what lies ahead is restricted and there may be nasty surprises along any path that we hack. But I appreciate that not everyone is so optimistic and, as always, I’d be interested in your thoughts on this topic – leave a comment or send me a message.