Author-produced PDF from LaTeX on the arXiv
arXiv.org has a policy that articles written in TeX/LaTeX should be uploaded as source (tex + bibliography + figures), rather than as a standalone whole-article PDF file. The enforce this policy automatically, by detecting whether the PDF file you upload has been generated from TeX, and blocking your submission if that's the case.
They have their reasons, explained in the policy linked. However, as any blanket policy enforced automatically by a computer program, it is bound to make mistakes sometimes. One case that particularly annoyed me: it rejects all PDF files including TeX-made figures, even if the PDF of the figure was then included in a MS Word manuscript and the whole thing converted to PDF. That was particularly annoying, because for a long period nobody at arXiv replied to my requests, and my files were just being rejected.
There are other reasons why I don't believe this strict policy is a good thing, even when it is technically accurate:
- I take great care of the manuscripts I submit, including non-standard fonts and sometimes typography / figures placement, sometimes with manual editing of the PDF before sending it to the publisher. I would rather people see those than the default LaTeX-styled version of my preprint. (yes, I'm a bit of a perfectionist when it comes to typography; I won't apologize)
- If the inclusion of proper metadata is the issue, there are many PDF manipulations solutions that can do that in an automated manner.
- Why should TeX users be treated more harshly than others? arXiv hosts some very badly formatted Word-produced (or LibreOffice-produced) PDF files.
In any case, here's how to fool the arXiv TeX detector:
1. It is looking for TeX-specific keys in information dictionaries in the PDF. Those look like this:
/PTEX.FileName (./figures/TE.pdf) /PTEX.PageNumber 1 /PTEX.InfoDict 279 0 R /PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014) kpathsea version 6.2.0)
The first three indicate the inclusion of a PDF figure, including its original file name (I consider this bad, because it could actually leak information about the document's author, such as home directory). The last one is only included once, indicating what version of TeX produced the document.
2. Those keys cannot be turned off from the TeX source, they're hardcoded in the pdftex program.
3. But you can replace all of these lines with blank characters, without invalidating the PDF. You cannot remove those characters, because that would mess up the look-up tables (called the Xref tables). But replacing each character with a space will result in a document that is still perfectly valid according to the PDF specification.
sed command-line utility to do so is simple:
sed -e '/PTEX\./s/./ /g' < submitted.pdf > arXiv.pdf
will produce a file named
arXiv.pdf from your original PDF file
Took me half an hour to figure that out in detail, and half an hour to write. Maybe it can save some other poor academics this same amount of time! Let me know in the comments if you ever had trouble of the sort…