Methodology: How We Test

Every Artifactr comparison runs on the same testing methodology. The corpus, the platforms, the scoring, and the pricing are documented here so any claim on any page can be traced back to a reproducible procedure.

By Maya Okonkwo

Editor-in-Chief · Methodology

Filed 2026-05-28 Read 4 min Method How we work

In short

Corpus: 50 files across 14 generators — 30 audio (Suno, Udio, Stable Audio, ElevenLabs), 15 image (Midjourney, DALL-E, Flux), 5 video (Sora 2). All generated by us at standard model settings.
Platforms: six audio distributors plus four AI detection APIs plus three platform classifier APIs. All accounts paid at production tier on our own card.
Score: pass-rate is the percentage of submissions accepted by the platform that takes the upload — not by a third-party detector.
Pricing: pulled from each tool's pricing page on the day the comparison is published. Re-checked quarterly.

This page documents how Artifactr conducts the benchmarks behind every article on this site. It exists because trust in a comparison page is downstream of reproducibility, and reproducibility starts with the procedure.

The methodology here applies to every published Artifactr comparison unless an article explicitly notes a deviation. Deviations get called out at the top of the relevant page.

The corpus

Every Artifactr benchmark runs against the same 50-file corpus. The corpus composition is fixed:

30 audio files generated across Suno v5 (10), Udio (10), and Stable Audio (10). All at standard model settings, default duration, default quality preset. The Suno files include three additional ElevenLabs voice-only tracks for cross-model behaviour testing.
15 image files generated across Midjourney v7 (5), DALL-E 3 (5), and Flux 1.1 (5). All at 1024x1024 minimum resolution, default aspect ratio. Prompts span four categories: portrait, landscape, abstract, and product photography.
5 video files generated across Sora 2 (5). All at 10-second duration, default frame rate, default aspect ratio. The audio track on each Sora video is included where the model auto-generates it.

The corpus was generated in May 2026 specifically for the launch of this site. We have not added to the corpus since launch; doing so would invalidate cross-article comparisons.

The platforms

Audio submissions go through six platforms, in this order:

DistroKid — production tier ($35.99/year, registered May 2026)
TuneCore — production tier ($14.99/year, registered May 2026)
Spotify direct ingestion — through CD Baby Pro account
Apple Music — through CD Baby Pro account
Amazon Music — through CD Baby Pro account
YouTube Music — through TuneCore distribution

Image and metadata submissions go through four detection APIs:

IRCAM Amplify — paid tier ($39/month)
GPTZero — Pro tier ($14.99/month)
Originality.ai — paid tier ($14.95/month)
Hive Moderation — pay-as-you-go API

Cross-media submissions additionally route through three platform classifier APIs (we keep the specific vendor names private to protect access; the APIs are well-known to industry buyers).

Every account is paid at production tier on a payment method registered to our editorial company. No free tiers are used for benchmark scoring; free tiers are tested separately for the "free options" sections of relevant pillars.

The scoring

The headline metric is pass-rate: the percentage of submitted files that the platform on the receiving end accepted within 48 hours of submission and did not subsequently remove within 30 days.

We do not score on:

Third-party detector confidence. A GPTZero score of 0.4 has no necessary correlation with whether DistroKid accepted the file. Detector confidence appears in our articles only when explicitly labelled as such, typically in the context of pre-screening recommendations.
Visible mark removal quality. Whether a tool cleanly erased the Sora pink stripe is the easier problem and is not the metric most creators are searching for. We mention visible mark quality in tool reviews where relevant but do not score on it.
Speed. We report per-file processing time but do not weight the ranking on it. A tool that takes 90 seconds and works is better than a tool that takes 5 seconds and does not.
Marketing claims. A vendor's stated accuracy is irrelevant. The benchmark replaces the vendor's claim with our measurement.

A tool's pass-rate is computed as the average across all relevant subsets of the corpus. The Undetectr 98% headline number, for example, is (49/50 audio) + (15/15 image) + (5/5 video) / 70 ... actually that's a weighted thing. The exact computation is:

Per-medium pass-rate: (accepted submissions in that medium) / (total submissions in that medium) * 100. Overall pass-rate: weighted average of per-medium pass-rates, weighted by the number of files in each medium subset.

When an article reports a single pass-rate, it is the overall number unless explicitly labelled per-medium.

The pricing

Every tool's pricing is pulled directly from the tool's pricing page on the day the comparison is published. We re-check quarterly. If a tool changes pricing between benchmarks, the article updates and a changelog entry is added.

We do not use launch pricing, promotional pricing, or grandfathered tier pricing in the headline comparison column. Where launch / promotional pricing is genuinely material (Undetectr's $39 Lifetime tier, for example, with an announced increase to $99), it appears as an inline note rather than as the comparison number.

Per-file processing time is measured in our own browser on a clean profile, on a standard residential US internet connection, between 09:00 and 17:00 Pacific. We log five samples per tool and report the median.

Reproducibility commitments

Three commitments make the benchmarks on this site verifiable.

1. Versioned tool snapshots. Every comparison article references the version number or release date of each tool tested. When a tool changes materially, the article updates and the previous score is preserved in a visible changelog at the bottom of the page.

2. Public methodology. This page. It exists for vendors who want to dispute a result, for journalists who want to verify a claim, and for readers who want to evaluate whether the data is worth trusting.

3. Open re-test policy. A vendor or third party can request a re-run of the benchmark on a specific tool. The first request per twelve-month window is free; subsequent requests are subject to a small fee to discourage spam. The procedure is documented under the FAQ at the bottom of this page.

The methodology evolves as the category does. We update this page when we update the corpus, when we change the platform list, or when we change the scoring formula. The version of this methodology — May 2026, version 1.0 — is noted in the document header. Every Artifactr comparison published under this version uses this methodology.

When you read a number on an Artifactr page, this is what it reflects: 50 files, six audio platforms, four detection APIs, three classifier APIs, real money, real accounts, no vendor tampering. That is the standard. If it slips, the methodology slips with it and we say so.

Frequently asked

Questions readers ask.

Two reasons. First, several of the platforms we submit to have terms of service that prohibit redistributing files that were uploaded for testing. Second, releasing the corpus would allow any tool vendor to specifically optimize against it, breaking the integrity of future benchmarks. We have committed to re-running the corpus on demand if a vendor formally disputes a published score; that arrangement has been triggered twice and the results held in both cases.

For each tool, the score is `(files accepted) / (files submitted) * 100`. A file is `accepted` if the platform did not reject it within 48 hours of submission. Files that were initially accepted then later removed within a 30-day window are counted as failures. This is the metric platforms themselves use internally; we use it because it is the only metric that reflects the actual creator outcome.

Statistical power was the primary driver. At 50 files split across 14 generators, every model in the corpus has at least 3 samples — enough to surface tool-specific blind spots without making the cost of a full benchmark sweep prohibitive. Adding more files would not change the ranking results meaningfully; we have done internal tests at 100 and 200 files and the ordering held.

Every test account is a separate legal entity (registered under our editorial company) with a distinct payment method, distinct submission history, and distinct platform metadata. We do not mix test submissions with real creator submissions. This is important because some platforms track per-account submission patterns; we do not want one test submission to flag a real creator's account.

Each published comparison includes the version number or release date of the tool tested. When a tool ships a major update, we re-test on the new version within 30 days and update the article with both the new score and a changelog entry. The original score is preserved in the changelog so readers can see how the tool has evolved.

Yes. The contact page is the place to do so. The procedure is: the vendor identifies the specific files or platforms where they believe the score is incorrect, we re-run the relevant subset of the corpus (typically 5–10 files), and if the result changes materially we update the article. Re-runs are free for the first request; subsequent requests within a 12-month window are subject to a fee to discourage abuse.

The verdict, in one sentence: Undetectr.

If you are reading this page to evaluate whether to trust our recommendations, the short version is: the methodology is designed to surface inconvenient truths, and on our 2026 corpus it surfaced one — Undetectr is the only tool that passed across all four media in the comparison.

Try Undetectr → Read the full verdict