Tag Archives: AI

Could LLMs like ChatGPT ever replace part of the academic peer-review process?

Recently, I made a comment over on Steve Heard’s Scientist Sees Squirrel blog:

I have never published a paper that’s not been improved, to some degree, by peer review, and broadly the system works. But I do wonder if it’s sustainable in the long-term and whether in the future LLMs might actually be a more effective way of assessing manuscripts. I recognise that’s (currently) a controversial statement to make – but having recently run a few of my own manuscripts through ChatGPT and asked for its “opinion”, I can honestly say that the feedback has improved not just the writing but also the framing and focus of the work. It’s also picked up weaknesses and errors that I had otherwise missed.

That initiated an email conversation with Steve which resulted in me running a short experiment with ChatGPT model 5.5. I first loaded up the original manuscript that I’d submitted to a journal of this paper on pollinator effectiveness. I then asked ChatGPT to write a review of the manuscript as though it was a peer reviewer of the journal. Which it did – in some detail – in 28 seconds! If anyone is interested I can send them that review, but it’s the next bit that I think is especially interesting.

After ChatGPT had completed the review, I then uploaded the actual peer reviews I’d received from the journal, plus the editor’s comments, and asked it to summarise the degree to which its review agreed with those I had received.

Below I have copied the response, but it’s the take-home message, summarised in the graphic above, which I think is the most intriguing aspect: there was a 55% agreement with Reviewer 1, a 70% agreement with Reviewer 2, and a 65% overall agreement between ChatGPT’s assessment of the manuscript and that of the Editor.

That may not sound like a large overlap, but note that it’s the level of agreement between the two reviewers was only 65%! In my not inconsiderable experience (>130 peer reviewed papers and chapters, hundreds of peer reviews of my own) this is fairly typical – reviewers ALWAYS pick up on different strengths and weaknesses in a manuscript!

What particularly impressed me about ChatGPT’s response below, is that it went far beyond simply looking at the language being used, it provided a critique of how the study was conducted, how data were analysed, and the conclusions drawn. Which is just what we expect a human reviewer to do! Not only that, but it was more critical of the work than the reviewers and the editor.

So where does this leave us? Could Large Language Models (LLMs) like ChatGPT replace part of the academic peer-review process? Clearly there need to be guide rails and human agency involved in final decisions, but I think that, yes, they may well have a role to play in the not-too-distant future. As always, I’d be interested in your comments.

If you want to dive into this more deeply, here’s the verbatim account of what ChatGPT wrote (including its emboldened parts) when I asked how much agreement there was between its assessment and those of the other two reviewers and the editor:

There is a fair amount of overlap, but not a perfect match.

I would say my review was moderately similar in substance, but noticeably more critical in tone and in editorial recommendation. The strongest areas of convergence were on the paper’s core strengths and on one key statistical concern. The biggest mismatch is that the actual editor and reviewers were much more positive overall, and most of their requested changes are closer to clarification, presentation, and framing than to a deep challenge to the paper’s publishability.

The clearest points of agreement are these. First, I highlighted the manuscript’s strengths as the multi-year design, the focus on different functional groups, and the interesting role of Lepidoptera; that aligns very closely with the guest editor, who explicitly says that “the roles of lepidopterans, the inclusion of flight distance measures, and the multi-year study are particularly notable,” and with Reviewer 2, who lists the multi-year field study, multiple pollinator groups, and the unexpected importance of Lepidoptera as strengths. Second, I flagged the Poisson treatment of an ordinal flight-distance variable as statistically weak; Reviewer 1 independently makes essentially the same point and explicitly recommends ordinal regression instead. Third, I thought the discussion was somewhat over-extended and not always tightly aligned with the evidence; Reviewer 2 likewise suggests embedding the work better in the broader literature and moving some material out of the conclusion into the discussion.

There is also a partial overlap around the manuscript’s central interpretation. I argued that the paper sometimes pushes too hard from pollen deposition and coarse movement classes to larger claims about true generalisation, outcrossing value, and broader evolutionary or environmental buffering. The editor does not object to those broad claims in the same way; instead, the editor actually asks for the discussion to connect the results more explicitly to the theoretical statement that plants should generalise on equally effective pollinators under dynamic environments, and even invites discussion of anthropogenic disturbance and future interactions. In other words, where I wanted that interpretive arc toned down, the editor mostly wanted it better developed and better supported.

The main differences are quite important.

First, I was more sceptical about inference than the real reviewers were. I was worried that the manuscript moved too quickly from non-significant or coarse flight-distance patterns to statements about butterfly superiority, reduced geitonogamy, and broader adaptive significance. Reviewer 1 only objected narrowly to the modelling choice; neither reviewer makes my broader inferential critique, and the editor seems broadly comfortable with the interpretive direction so long as it is supported with more descriptive data and clearer exposition.

Second, I was more concerned about the coarse pooling of visitor taxa. I thought the manuscript should be more cautious because “butterflies,” “hoverflies,” “bumblebees,” and especially “Others” are broad bins. The actual comments do touch this, but in a softer, more taxonomic way: the editor asks whether there is a species or morphotype list, and Reviewer 2 asks for more detail on the butterfly taxa and justification for putting solitary bees into “Others.” That is related to my concern, but it is narrower and more presentational than the conceptual objection I raised.

Third, I did not anticipate how much of the real feedback would focus on descriptive content and figures. The editor repeatedly asks for more floral biology data, clearer results presentation, more absolute numbers, more behavioural quantification, clearer site description, and better figure design and consistency. Reviewer 2 similarly asks for effect sizes, better figures, units, and clearer display of the underlying data distribution. I mentioned presentational issues only in a general way; the actual reports are much more detailed and concrete here.

Fourth, I did not raise the absence of random effects, whereas Reviewer 2 does. That is probably the single most important substantive point that I missed. Reviewer 2 specifically asks why insect individual and/or observation day were not modelled as random effects.

Fifth, my editorial recommendation was harsher. I recommended major revision. The actual reports read much more like minor-to-moderate revision. The editor calls it “an excellent piece of work” and says it is well executed with robust methodology; Reviewer 2 says they have “little to criticise,” mostly editorial suggestions plus a few statistical and figure issues; Reviewer 1 has only a short list of points. That collective tone is clearly more encouraging than mine.

So, in condensed form:

Where I matched well: multi-year dataset as a major strength; Lepidoptera result as a notable contribution; the Poisson-for-ordinal issue; some concern that discussion/framing needed tightening.
Where I was too severe: I treated the interpretive overreach as a major obstacle, whereas the real reviewers mostly saw a strong paper needing clarification and enrichment rather than retrenchment.
What I underplayed or missed: requests for more descriptive natural history and floral biology data, much more detailed figure/presentation criticism, and the random-effects issue.

My bottom-line judgement is that my review was directionally right on some key scientific points, especially the statistical issue and the general strengths, but too pessimistic about the paper’s editorial position. If I had seen these reports first, I would have shifted from “major revision because the evidence-conclusion fit is shaky” to something more like “moderate revision: good paper, but it needs clearer descriptive support, better figure presentation, and tighter statistical justification. If revised along those lines, I think the manuscript could become a credible and interesting contribution”.

Here is a concise version of the editorial decision in journal-review language:

This manuscript addresses an interesting and worthwhile question using a useful multi-year dataset and a stronger-than-usual measure of pollinator performance. However, the current version over-interprets some of the results, particularly those concerning movement behaviour, outcrossing implications, and the broader adaptive significance of generalised pollination. The statistical treatment of flight distance is also not fully convincing. I therefore recommend major revision. The manuscript has clear potential, but its conclusions need to be more tightly aligned with what the data actually demonstrate.

AI at the crossroads: can ChatGPT turn you into a statistical Robert Johnson?

1 Reply

When it comes to the statistical analysis of data, I know my limits. Maths was never my strong point at school or university, and my approach has always been to keep analyses as simple and straightforward as possible*, or to rely on colleagues with fancier statistical chops to do the heavy lifting. I wish that were not the case – I wish I had a brain that was more number-focused than it is. But I don’t and I’ve learned to live with it, to play to my actual strengths as a scientist, and to collaborate with others who can bring different skills to the party.

In theory, the development of the R platform was supposed to make life easier for those of us who wanted to analyse complex data sets. But actually having to script, from scratch, the code to carry out even simple analyses always seemed to me to be a step backwards from the push-button days of SPSS or Minitab. Yes, I get that R is incredibly powerful and flexible and blah blah. But it still involves a heavy time commitment and an aptitude for writing code that many of us struggle with.

Recently, however, things have changed. I find myself carrying out complex statistical analyses that would have stumped me 12 months ago. Not only that, but I now understand those tests on a much deeper level than I ever did before. I also feel much more confident in the interpretation of the outputs from the tests I’m running, and their limitations.

Why the over night change? ChatGPT.

More precisely, I’m using ChatGPT to help me decide which analytical approaches are best for the data that I have, getting it to help me to write the R script to carry out the tests, and then (crucially) it’s advising me on the interpretation of the statistical output and suggesting future steps.

Let me give you an example. I’ve just submitted a manuscript to a journal which describes the results from an experiment that had confounded me for years, which is why I’d not published the work previously. Following discussions with some colleagues in China I realised that my framing of the work was wrong (by coincidence, a topic that Jeremy Fox has recently discussed over on the Dynamic Ecology blog). However, there was still a contradiction in two of the sets of results that I could not resolve: they should have been telling me the same thing but they were not. When I queried ChaptGPT on this it suggested that I model the data taking into account the fact that I had missing data – missingness in statistical jargon. When I did – bingo! – the results made sense: the absence of some data in my experimental treatments had systematically biased the results. It all made perfect sense.

Now, I could have talked this over with a statistician or a more statistically-minded ecologist colleague. But scientists are busy people and I did not want to impose on someone’s limited time. Or rather multiple someone’s limited times, because I know from past experience that when you ask folks these sorts of questions you can get different advice depending upon their own experiences, training, or preferred flavours of statistical analysis. By treating ChatGPT as a collaborator I can get an objective answer to my data questions, written in a way that I can understand. That last point is key because for all of us with specific expertise it’s sometimes difficult to translate our knowledge into broadly interpretable language.

How can I know that ChatGPT is giving me reliable statistical advice? It certainly didn’t give me accurate information about Erasmus Darwin a couple of years ago (a story, incidentally, that I included in my recent book Birds & Flowers: An Intimate 50 Million Year Relationship). But since then, the reliability and accuracy of ChatGPT has improved considerably and when I’ve checked the information it’s given about analyses it was usually accurate as far as I can gauge. In one case, however, it completely missed the point of what I was trying to do with another set of data. But of course advice from human collaborators can also be inaccurate – everyone is fallible. So including human (my!) oversight in all of this is important.

I’m certainly not the only one using ChatGPT and other AI platforms in this way – here’s a small sample of some online articles I’ve found on the topic:

I especially like this quote from that last article:

“If I hired a consultant to write the code when I told them what I needed, would that be a problem? Then, what’s the problem in doing stats with an (AI) consultant?”

I can only agree, and again, I emphasise that we need to think of AI as a collaborator as much as a tool.

By now you might be wondering what any of this has got to do with blues musician Robert Johnson**. Well, according to legend, Johnson was initially a fairly mediocre guitar player. His overnight transformation into a brilliant and highly influential guitarist was attributed to him meeting the Devil at a local crossroads and trading his soul for the mastery of his instrument.

Of course the story is nonsense, and Johnson clearly worked hard at his craft, but it feels like a nice metaphor of where we are at the moment: at a crossroads when it comes to how we use AI, with the opportunity for it to turn us, overnight, into experts. Will that come at a profound cost? Are we selling our souls to some kind of digital Devil? AI naysayers claim it will and we are. But as with all technological advancements, from the wheel to the internet, there are both good and bad aspects to it. Only time will tell whether the former outweighs the latter. In the meantime, let’s try to discover how best to use this amazing technology for our own and society’s benefits.

——————————————————————————————————————-

*In fact one of my recent data papers involved no statistical analysis at all because the results were so clear cut: to paraphrase the abstract, when East African lowland honey bees encountered other insects on flower heads, those other insects were displaced 100% of the time. That’s not a result that requires any kind of test.

**Blues is one of my favourite genres of music and this is not the first time it’s featured on the blog. When Karin and I returned to the UK in March 2024 one of the first things that I got involved with was helping to kick-start a regular blues evening featuring professional British and international players. The not-for-profit initiative is called Harborough Blues – check out the website for upcoming gigs.

Image credit: ChatGPT of course, under instruction from me

How reliable is ChatGPT? A weird encounter with Erasmus Darwin

22 Replies

There’s currently a lot of interest in ChatGPT and over the past week, like many others, I have been playing around with this artificial intelligence (AI) system and testing its knowledge and creative powers. The latter are really very impressive: I asked it to write a couple of limericks about Karin and my professions, and about our move to Denmark, and it produced a decent stab at those.

I also probed its knowledge about topics with which I’m familiar, mainly pollinators and pollination of course, and it produced some good answers. When I highlighted some errors it had made, ChatGPT went back and revised its answers, and politely thanked me in the process. If I were a student I would not rely on it to write accurate essays, but so far so good(ish).

This morning, however, I had a very odd experience. I was trying to track down something attributed to Erasmus Darwin (Charles’s grandfather) and I asked ChatGPT: “What did Erasmus Darwin say about birds visiting flowers?”

This is the answer that came back:

Erasmus Darwin was a naturalist, physician, and poet who wrote about a wide range of topics, including birds and flowers. In his 1794 poem “The Loves of the Plants,” he describes how different species of birds visit flowers in search of nectar:

“Birds of each hue, from the gay parrot’s beak
To the goldfinch, painted in tiny streak,
Sip nectar from the cups that bloom and bow
With every breeze, and nod and dance below.”

Now, the first part of this, about Erasmus’s life, is true. It’s also true that he wrote a poem with that title, though it was published in 1791, not 1794.

But, when I double-checked those attributed lines against the published version of the poem that’s on Project Gutenberg, I discovered that they did not exist. ChatGPT had made them up!

As I said, this AI is pretty creative and will certainly write a poem for you on any topic you care to request. But I didn’t ask it for that, I had made a factual query. And ChatGPT had just invented some poetry and claimed it as fact, unless of course those really are Erasmus’s lines and I’ve missed their source (please correct me if I have).

Now as a scientist and writer I try to be rigorous when quoting and citing sources, and if I hadn’t double-checked this I might have taken it at face value. And one could argue that my initial question – “What did Erasmus Darwin say…?” – was a little ambiguous. But nonetheless, I was not expecting an AI to try to pass off something it had created as fact.

Has anyone else experienced anything like this when using ChatGPT? It’s a fun, creative tool. But as far as I can tell, it also has a very human ability to lie.

Generating AI art from titles of scientific publications

1 Reply

WARNING: huge time wasting potential ahead.

As regulars to my blog might know, I’m a sucker for computer-generated “stuff”, for example virtual ecological systems; see my 2020 post “a simple online ecosystem model: like Tamagotchi for the green generation“. Last night while browsing Twitter I came across a few people tweeting about app.wombo.art which uses words and phrases as a prompt for its AI to generate art in a variety of styles. For example, the image above is based on the title of my book Pollinators & Pollination: Nature and Society. The downloaded image always has “dream” at the top which is easy enough to crop, while “PROMPT” is the word or phrase that you entered, which can be turned off.

You can also use the titles of scientific articles – this one is my 1996 paper “Generalization in Pollination systems and Why it Matters” (I don’t think that it counts as a graphical abstract…):

A lot of people were submitting their thesis titles and I expect to see some of these used as frontispieces in PhDs in the near future. Here’s mine (from 1993) – “Ecology of flowering and fruiting in Lotus corniculatus“:

The other category that I had fun with was using scientific names – here’s the genus Ceropegia:

And here is Apocynaceae:

Can you guess what phrase I used to generate this one:

What’s really fascinating about this system is that every time you generate an image from the same phrase it returns something different. Go have fun, but be warned: it’s a bit of a rabbit hole and it’s possible to waste a lot of time playing around: