Can AI outperform human editors? We put OpenAI to the test!

At Packt Publishing we are always looking for ways to improve our editorial process, especially as we handle highly technical content across a wide range of subjects. When OpenAI released o1-preview and o1-mini, we wanted to test what these reasoning models could do for quality assurance (QA). The models are pitched as strong at tasks like code validation and clarity, where precision and speed matter in technical publishing.

I ran an experiment to compare the feedback from o1-preview and o1-mini against that of our experienced human editors. The question was whether these tools could improve our workflow once we accounted for cost, response speed and functionality.

Why OpenAI o1-preview and o1-mini?

o1-preview is built to solve complex problems through deeper reasoning, which makes it interesting for high-level editorial work like validating code snippets, checking technical accuracy and suggesting clearer phrasing. o1-mini is a cheaper, faster version of the model, and it looked like a good fit for discrete STEM tasks such as maths-related QA.

We wanted to see how the models held up in real editorial scenarios, using content already in development at Packt.

The experiment: testing AI in publishing

We fed several draft chapters through o1-preview and o1-mini and compared the AI feedback against the work of our human editors. We focused on three core areas:

Code validation: could the models spot bugs, inefficiencies or errors in the code?
Clarity of explanations: could they suggest ways to make technical content more readable and accessible?
Fact-checking: could they flag outdated or inaccurate information?

Results: how o1-preview and o1-mini performed

1. Code validation: o1-preview vs. o1-mini

Both models were strong on code validation, in different ways. o1-preview handled the more complex reasoning. In one Java example it caught a logic error that would have made the application behave unexpectedly in certain environments. It flagged the bug and then offered a more efficient fix that improved performance, the kind of thing that can slip past a traditional review.

o1-mini did well on more targeted work like maths validation and basic coding logic checks. Its reasoning capacity is lower than o1-preview, but it was much faster and still gave high-quality, relevant feedback, particularly for STEM content.

2. Clarity of explanations

On the clarity of technical explanations, o1-preview offered useful input but did not clearly beat our human editors. It suggested ways to simplify dense sections, though the gains were incremental rather than dramatic. It occasionally reworded specific technical terms well, but it was not consistent enough to make a high-impact difference.

o1-mini, tuned for STEM reasoning, did well on mathematical explanations and coding logic. It struggled with more general editorial work because it lacks the broad world knowledge that non-STEM content needs.

3. Fact-checking

Fact-checking was a mixed result for o1-preview. It flagged outdated data in a cloud architecture guide and pointed to sections that needed updating, but it struggled with newer, niche technologies, where human expertise was still needed. It was good at verifying general tech specifications and less reliable in emerging fields.

o1-mini, focused on STEM, was weaker at broader fact-checking. It was strong on mathematical validation and code logic, but its limited world knowledge made it less useful for verifying non-technical details or handling general editorial queries.

Challenges: cost, speed, and limitations

The models showed promise, but a few practical issues make them hard to use at large scale right now.

1. Higher cost

o1-preview is far more expensive than the GPT API calls we normally use, so running it across large projects quickly becomes prohibitive. The reasoning is impressive, but the price is hard to justify for routine editorial work when human editors or cheaper models can handle the simpler checks.

o1-mini is 80% cheaper than o1-preview and better suited to discrete tasks like maths reasoning and code validation. That makes it attractive for targeted use, especially when we need quick, accurate feedback on a specific technical task.

2. Slower response speed

A major drawback of o1-preview was its slow response, a side effect of its reasoning-heavy design. The extra processing time occasionally produced more considered feedback, but it slowed our workflow enough to be hard to scale in a busy publishing environment. o1-mini returned feedback much faster, which makes it the better fit for large-scale tasks where quick turnaround matters.

3. Limited functionality

Both models had functionality gaps. o1-preview, still in its preview phase, lacks features like batch processing and structured data support that are essential for scaling editorial work. That made it awkward for larger projects, where processing multiple files or batches at once is what keeps things efficient.

o1-mini is effective at what it is built for, but its STEM focus means it lacks the broader world knowledge needed for fact-checking or more varied editorial tasks. It is less versatile than other models, though its specialisation in coding and maths still makes it valuable in those areas.

A promising supplement, not a full replacement

Our experiment with o1-preview and o1-mini showed real potential, but neither model is ready to replace human editors or cover every QA task. o1-preview is strong on complex code validation and reasoning, held back by its high cost and slow responses. o1-mini is faster and cheaper for discrete tasks like maths validation and coding checks, which makes it a good fit for specialised use.

At Packt we see both models as useful supplements to our human-led QA process. They save time and improve accuracy in targeted areas, especially in code-heavy content. For broader work like wide-ranging fact-checking or content clarity, human expertise is still irreplaceable. As OpenAI keeps developing these models, we will keep testing what they can do.