Python for Digital Journalism Investigation: Analyzing Article Output Patterns
As avid fans of the sport of cycling, we consume our fair share of cycling-related journalism. So when we noticed that a sports journalist at a major Danish public broadcaster appeared to churn out several lengthy articles - in many cases more than 50,000 characters - on a weekly basis, we were curious if this was a sign of superhuman-like typing skills or an undisclosed use of AI. Don't get us wrong, AI can be a fantastic resource for journalists and media professionals when done right.
It can be used to enhance research by processing large datasets quickly, assist with transcription of interviews saving valuable time, and help journalists identify trends or connections that might otherwise be missed in complex investigations. However, media organizations carry a particular responsibility when it comes to AI-generated content, particularly when AI is employed beyond the scope of journalistic research and makes its way into actual articles.
In this blog post, we investigate whether it's realistically possible for a journalist to sustain a very high output of published content over several months — or whether the consistency and volume might point to AI involvement. We'll go through how to:
Build a scraper using Python
Clean and process data with Pandas
Analyze publication frequency and character volume
Compare the results to human cognitive limits
Building a Web Scraper with Python
We began by collecting 335 articles published by the journalist between January 2, 2025, and May 5, 2025. While we initially used Instant Data Scraper to identify article URLs, we needed a more robust solution to analyze the content itself. Here's how we built our own scraper using Python:
Step 1: Setting Up the Environment
First, we needed to install the necessary libraries:
Step 2: Creating Helper Functions
We created functions to extract dates from URLs and to handle the scraping process: