On-Device AI for Journalism: A Local LLM Test | en

The siren song of artificial intelligence grows louder, promising efficiency and transformation across industries. A particularly enticing prospect is running powerful AI models directly on personal computers, bypassing cloud dependence, subscription fees, and data privacy concerns. Giants like Google, Meta, and Mistral AI have made sophisticated Large Language Models (LLMs) freely available for download. But does this accessibility translate into practical utility? Can these digital minds, confined to the silicon of a desktop or laptop, truly augment complex workflows like journalistic writing? This account details an extensive experiment designed to answer precisely that question.

Setting the Stage: The Local AI Experiment

Over several months, a dedicated effort was undertaken to evaluate the real-world performance of various freely downloadable LLMs operating entirely on local hardware. The roster of models under scrutiny was diverse, reflecting the rapidly evolving landscape of open-source AI:

Google Gemma (specifically version 3)
Meta Llama (version 3.3)
Anthropic Claude (version 3.7 Sonnet – though typically cloud-based, its inclusion suggests broad testing)
Multiple iterations from Mistral AI (including Mistral, Mistral Small 3.1, Mistral Nemo, and Mixtral)
IBM Granite (version 3.2)
Alibaba Qwen (version 2.5)
DeepSeek R1 (a reasoning layer often applied over distilled versions of Qwen or Llama)

The core objective was ambitious yet practical: to determine if these locally run AIs could transform raw interview transcripts into polished, publishable articles. This involved assessing not just the technical feasibility – could the hardware handle the load? – but also the qualitative output – was the resulting text usable? It’s crucial to state upfront that achieving a fully automated, publish-ready article proved elusive. The primary goal shifted towards understanding the genuine capabilities and limitations of current on-device AI through this specific, demanding use case.

The chosen methodology centered around a substantial prompt. This included approximately 1,500 tokens (roughly 6,000 characters or two full pages of text) meticulously outlining the desired article structure, style, and tone. Added to this instruction set was the interview transcript itself, averaging around 11,000 tokens for a typical 45-minute conversation. The sheer size of this combined input (often exceeding 12,500 tokens) typically surpasses the free usage limits of many online AI platforms. This constraint underscored the rationale for exploring local deployment, where processing remains free regardless of input size, limited only by the machine’s capabilities.

Executing these tests involved using LM Studio, a popular community software that provides a user-friendly chatbot-like interface for interacting with LLMs running locally. LM Studio conveniently integrates functions for downloading various model versions, although the primary source for these freely available models remains the Hugging Face repository, a central hub for the AI community.

Navigating the Technical Labyrinth: Hardware, Memory, and Model Size

The journey into local AI processing quickly revealed a complex interplay between software and hardware. The quality and speed of the AI’s output were intimately tied to the resources available on the test machine – a Mac equipped with an Apple Silicon M1 Max system-on-chip (SoC) and a generous 64 GB of RAM. Critically, this architecture features Unified Memory Architecture (UMA), allowing 48 GB of RAM to be dynamically shared among the processor cores (CPU), graphics cores (GPU – used for vector acceleration), and neural processing unit cores (NPU – used for matrix acceleration).

Several key technical factors emerged as decisive:

Model Parameters: LLMs are often measured by their number of parameters (billions, typically). Larger models generally possess greater knowledge and nuance. However, they demand significantly more memory.
Quantization: This refers to the precision used to store the model’s parameters (e.g., 8-bit, 4-bit, 3-bit). Lower bit precision drastically reduces memory footprint and increases processing speed, but often at the cost of accuracy and output quality (introducing errors, repetition, or nonsensical language).
Context Window: This defines the maximum amount of information (prompt + input data) the AI can consider at once, measured in tokens. The required window size is dictated by the task; in this case, the large prompt and transcript necessitated a substantial window.
Available RAM: The amount of memory directly limits which models (and at which quantization level) can be loaded and run effectively.

The sweet spot, providing the best balance of quality and feasibility on the test machine at the time of evaluation, was achieved using Google’s Gemma model with 27 billion parameters, quantized to 8 bits (version ‘27B Q8_0’). This configuration operated within a 32,000-token context window, comfortably handling the approximately 15,000-token input (instructions + transcript). It ran on the specified Mac hardware, utilizing the 48 GB of shared memory.

Under these optimal conditions, the processing speed was measured at 6.82 tokens per second. While functional, this is far from instantaneous. Speed improvements without sacrificing output quality primarily depend on faster hardware – specifically, SoCs with higher clock speeds (GHz) or a greater number of processing cores (CPU, GPU, NPU).

Attempting to load models with significantly more parameters (e.g., 32 billion, 70 billion) quickly hit the memory ceiling. These larger models either failed to load entirely or produced severely truncated, unusable output (like a single paragraph instead of a full article). Conversely, using models with fewer parameters, while freeing up memory, resulted in a noticeable decline in writing quality, characterized by repetition and poorly articulated ideas. Similarly, employing more aggressive quantization (reducing parameters to 3, 4, 5, or 6 bits) boosted speed but severely degraded the output, introducing grammatical mistakes and even fabricated words.

The size of the required context window, determined by the input data, is essentially non-negotiable for the task. If the input data demands a window that, combined with the chosen model size and quantization, exceeds available RAM, the only recourse is to select a smaller model, inevitably compromising the potential quality of the final result to stay within memory limits.

The Quest for Quality: When Structure Meets Substance (or Lack Thereof)

Did the locally run AI succeed in generating usable articles? Yes and no. The generated texts often exhibited surprisingly good structure. They generally adhered to the requested format, featuring:

A discernible angle or focus.
A coherent flow through thematic sections.
Appropriately placed quotations from the transcript.
Engaging headlines and concluding sentences.

However, a critical flaw emerged consistently across all tested LLMs, including those like DeepSeek R1, specifically designed for enhanced reasoning: a fundamental inability to correctly discern and prioritize the relevance of information within the interview. The AI models consistently missed the crux of the conversation, focusing on secondary points or tangential details.

The result was often articles that were grammatically sound and well-organized but ultimately superficial and uninteresting. In some instances, the AI would dedicate significant, well-argued passages to stating the obvious – for example, elaborating at length that the interviewed company operates in a market with competitors. This highlighted a gap between linguistic competence (forming coherent sentences) and genuine comprehension (understanding importance and context).

Furthermore, the stylistic output varied considerably between models:

Meta’s Llama 3.x: At the time of testing, produced sentences that were often convoluted and difficult to parse.
Mistral Models & Gemma: Showed a tendency towards a ‘marketing speak’ style, employing effusive adjectives and positive framing but lacking concrete substance and specific detail.
Alibaba’s Qwen: Surprisingly, within the constraints of the test setup, this Chinese model produced some of the most aesthetically pleasing prose in French (the language of the original evaluation team).
Mixtral 8x7B: Initially, this ‘mixture of experts’ model (combining eight smaller, specialized 7-billion parameter models) showed promise. However, fitting it within the 48 GB memory constraint required aggressive 3-bit quantization, which led to significant syntax errors. A 4-bit quantized version (‘Q4_K_M’) offered a better compromise initially, but subsequent updates to the LM Studio software increased its memory footprint, causing this configuration to also produce truncated results.
Mistral Small 3.1: A more recent model with 24 billion parameters at 8-bit quantization emerged as a strong contender. Its output quality approached that of the 27B Gemma model, and it offered a slight speed advantage, processing at 8.65 tokens per second.

This variation underscores that choosing an LLM isn’t just about size or speed; the underlying training data and architecture significantly influence its writing style and potential biases.

Hardware Architecture: The Unsung Hero of Local AI

The experiments shed light on a crucial, often overlooked factor: the underlying hardware architecture, specifically how memory is accessed. The superior performance observed on the Apple Silicon Mac wasn’t solely due to the amount of RAM but critically hinged on its Unified Memory Architecture (UMA).

In a UMA system, the CPU, GPU, and NPU cores all share the same pool of physical RAM and can access data at the same memory addresses simultaneously. This eliminates the need to copy data between separate memory pools dedicated to different processors (e.g., system RAM for the CPU and dedicated VRAM for a discrete graphics card).

Why is this so important for LLMs?

Efficiency: LLM processing involves intense computation across different types of cores. UMA allows seamless data sharing, reducing latency and overhead associated with data duplication and transfer.
Memory Utilization: In systems without UMA (like a typical PC with a discrete GPU), the same data might need to be loaded into both the main system RAM (for the CPU) and the GPU’s VRAM. This effectively reduces the usable memory for the LLM itself.

The practical implication is significant. While the test Mac could comfortably run a 27-billion parameter, 8-bit quantized model using 48 GB of shared UMA RAM, achieving similar performance on a PC without UMA might require substantially more total RAM. For example, a PC with 48 GB total RAM split into 24 GB for the CPU and 24 GB for the GPU might only be capable of running a much smaller 13-billion parameter model effectively, due to the memory partitioning and data duplication overhead.

This architectural advantage explains the early lead Macs with Apple Silicon chips gained in the local AI space. Recognizing this, competitors like AMD announced their Ryzen AI Max SoC range (expected in early 2025) designed to incorporate a similar unified memory approach. As of the time of these tests, Intel’s Core Ultra SoCs, while integrating CPU, GPU, and NPU, did not feature the same level of fully unified memory access across all core types. This hardware distinction is a critical consideration for anyone serious about running larger, more capable LLMs locally.

The Intricate Dance of Prompt Engineering

Getting an AI to perform a complex task like transforming an interview into an article requires more than just powerful hardware and a capable model; it demands sophisticated instruction – the art and science of prompt engineering. Crafting the initial 1,500-token prompt that guided the AI was a significant undertaking.

A useful starting point involved reverse engineering: feeding the AI a completed, human-written article alongside its corresponding transcript and asking what prompt should have been given to achieve that result. Analyzing the AI’s suggestions across several diverse examples helped identify essential elements for the instruction set.

However, AI-generated prompt suggestions were consistently too brief and lacked the necessary detail to guide the creation of a comprehensive article. The real work lay in taking these initial AI-provided leads and elaborating on them, embedding deep domain knowledge about journalistic structure, tone, style, and ethical considerations.

Several non-intuitive lessons emerged:

Clarity over Elegance: Surprisingly, writing the prompt in a more natural, flowing style often decreased the AI’s comprehension. Models struggled with ambiguity, particularly pronouns (‘he,’ ‘it,’ ‘this’). The most effective approach involved sacrificing human readability for machine precision, explicitly repeating subjects (‘the article should…’, ‘the tone of the article must…’, ‘the introduction of the article needs…’) to avoid any potential misinterpretation.
The Elusive Nature of Creativity: Despite careful prompt design aimed at allowing flexibility, the AI-generated articles consistently shared a ‘family resemblance.’ Capturing the breadth of human creativity and stylistic variation within a single prompt, or even multiple competing prompts, proved exceptionally difficult. True variety seemed to require more fundamental shifts than prompt tweaking alone could provide.

Prompt engineering is not a one-time task but an iterative process of refinement, testing, and incorporating specific business logic and stylistic nuances. It requires a blend of technical understanding and deep subject matter expertise.

The Workload Shift: Unpacking the AI Paradox

The experiments ultimately led to a critical realization, termed the AI paradox: in its current state, for AI to potentially alleviate some user workload (writing the article draft), the user often has to invest more preliminary work.

The core issue remained the AI’s inability to reliably gauge relevance within the raw interview transcript. To produce a pertinent article, simply feeding the entire transcript was insufficient. A necessary intermediate step emerged: manually pre-processing the transcript. This involved:

Stripping out irrelevant chatter, digressions, and redundancies.
Potentially adding contextual notes (even if not meant for the final article) to guide the AI’s understanding.
Carefully selecting and perhaps reordering key segments.

This transcript ‘curation’ requires significant human time and judgment. The time saved by having the AI generate a first draft was effectively offset, or even outweighed, by the new task of meticulously preparing its input data. The workload didn’t disappear; it merely shifted from direct writing to data preparation and prompt refinement.

Furthermore, the detailed 1,500-token prompt was highly specific to one type of article (e.g., an interview about a product launch). Covering the diverse range of article formats a journalist produces daily – startup profiles, strategic analyses, event coverage, multi-source investigations – would require developing, testing, and maintaining a separate, equally detailed prompt for each use case. This represents a substantial upfront and ongoing engineering investment.

Worse still, these extensive experiments, spanning over sixmonths, only scratched the surface. They focused on the simplest scenario: generating an article from a single interview, often conducted in controlled settings like press conferences where the interviewee’s points are already somewhat structured. The far more complex, yet commonplace, tasks of synthesizing information from multiple interviews, incorporating background research, or handling less structured conversations remained unexplored due to the time investment required even for the basic case.

Therefore, while running LLMs locally is technically feasible and offers benefits in terms of cost and data privacy, the notion that it readily saves time or effort for complex knowledge work like journalism is, based on this investigation, illusory at present. The required effort simply transforms, moving upstream into data preparation and highly specific prompt engineering. On these specific challenges – discerning relevance, requiring extensive pre-processing – the locally run AI performed comparably to paid online services, suggesting these are fundamental limitations of the current generation of LLMs, regardless of deployment method. The path to truly seamless AI assistance in such domains remains intricate and demands further evolution in both AI capabilities and our methods of interacting with them.

updated at 2025-03-28

# AI # LLM # Prompt Engineering