FX's blog: musings on chemistry, among other things…

To content | To menu | To search

Friday 11 December 2015

Same-journal citation in Chemistry

(9 months, no blog entry…)

This week, I read two articles on very different aspects of citation patterns. The first one was an analysis, by Stuart Cantrill over at the Nature Chemistry blog, of the journal's impact factor citation distribution. The second was a very enlightening Science paper from 2012 on Coercive Citation in Academic Publishing, and it got me wondering: I have never, so far, experienced coercive citation from an editor (i.e. instructions from an editor asking to add more citations to their own journal). However, I have been in the past — several times — advised by wise and knowledgeable (read: older) colleagues to make sure I was including same-journal references before submitting a manuscript. Usually it comes with some logic behind it, like “it makes the editor less likely to judge the paper out of scope if they see a lot of citations from their own journal”. But although that logic could apply to specialized journals, it makes no sense for more generic journals…

So, I wanted to look at the citation patterns among the “top three” general audience chemistry journals, namely Nature Chemistry, JACS and Angewandte Chemie. I took a sample of 2000 papers each from the 2010–2014 period (only 1200 for Nature Chem, which publishes fewer papers), and looked at the distribution of references from those papers. Let's first look at the most-frequently cited journals for each source:

Most-frequently cited journals

First, we see that in all three journals, JACS and Angewandte are the most cited journals (in this order). This makes sense: they really are the top general journals in chemistry and publish a lot of papers every year (much more so than Nature Chem, this difference in volume of publications explaining the much lower spot of the later). After that, you can begin to spot differences in the journals cited: Nature Chem features more heavily interdisciplinary journals Science, Nature, and PNAS. On the other hand, Angewandte (and JACS to a smaller extent) clearly features more citations to subfield-specific journals, and in particular organic chemistry journals. This would reflect a heavier focus of Angewandte on organic chemistry and synthesis, something that is regularly mentioned in chemistry circles (mostly by people outside molecular chemistry)! On the other hand, the only subfield-specific journal to make it into Nature Chem’s top 10 is a physical chemistry journal, the venerable Journal of Chemical Physics (which is one of my personal favorites).

Next, let's focus on citations between the three journals themselves:

Same-journal citations

For column X (source) and row Y (citation target), the table tells you what's the percentage of citations to Y in X. For example, 12.1% of the references found in Angewandte are to articles in Angewandte, while 14.8% are to papers in JACS. What is interesting (to me) is to look at same-journal citation, i.e. whether papers in journal X are more likely to be cited in this same journal than in other journals. And… it is the case, by about 2% to 3% in each case. So there is a small but significant excess of same-journal citation. I can see three explanations possible:

  1. One explanation may be that these journals have different audiences, and therefore it is natural that they feature more self-citation than other journals. I think this is definitely not true in terms of the subfields of chemistry, since all three are general chemistry journals with broad readership.
  2. There might be a geographical effect, with more US authors in JACS and more German authors in Angewandte… but are you really more likely to cite your (geographical) neighbor rather than the work of chemists from another continent? I do not think this can account for the differences observed.
  3. The final reason is that there may still exist, consciously or unconsciously, a tendency to include (or favor) same-journal references when writing a manuscript for a specific journal.

Let me know in the comments or on Twitter what you think! There surely are other possible reasons I have failed to see…

PS: on the topic of geographical diversity of these journals, you can go back and see my earlier post on the globalization of chemistry as seen through publications in the field…

Wednesday 26 November 2014

Author-produced PDF from LaTeX on the arXiv

arXiv.org has a policy that articles written in TeX/LaTeX should be uploaded as source (tex + bibliography + figures), rather than as a standalone whole-article PDF file. The enforce this policy automatically, by detecting whether the PDF file you upload has been generated from TeX, and blocking your submission if that's the case.

They have their reasons, explained in the policy linked. However, as any blanket policy enforced automatically by a computer program, it is bound to make mistakes sometimes. One case that particularly annoyed me: it rejects all PDF files including TeX-made figures, even if the PDF of the figure was then included in a MS Word manuscript and the whole thing converted to PDF. That was particularly annoying, because for a long period nobody at arXiv replied to my requests, and my files were just being rejected.

There are other reasons why I don't believe this strict policy is a good thing, even when it is technically accurate:

  • I take great care of the manuscripts I submit, including non-standard fonts and sometimes typography / figures placement, sometimes with manual editing of the PDF before sending it to the publisher. I would rather people see those than the default LaTeX-styled version of my preprint. (yes, I'm a bit of a perfectionist when it comes to typography; I won't apologize)
  • If the inclusion of proper metadata is the issue, there are many PDF manipulations solutions that can do that in an automated manner.
  • Why should TeX users be treated more harshly than others? arXiv hosts some very badly formatted Word-produced (or LibreOffice-produced) PDF files.

In any case, here's how to fool the arXiv TeX detector:

1. It is looking for TeX-specific keys in information dictionaries in the PDF. Those look like this:

/PTEX.FileName (./figures/TE.pdf)
/PTEX.PageNumber 1
/PTEX.InfoDict 279 0 R
/PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014) kpathsea version 6.2.0)

The first three indicate the inclusion of a PDF figure, including its original file name (I consider this bad, because it could actually leak information about the document's author, such as home directory). The last one is only included once, indicating what version of TeX produced the document.

2. Those keys cannot be turned off from the TeX source, they're hardcoded in the pdftex program.

3. But you can replace all of these lines with blank characters, without invalidating the PDF. You cannot remove those characters, because that would mess up the look-up tables (called the Xref tables). But replacing each character with a space will result in a document that is still perfectly valid according to the PDF specification.

Using the sed command-line utility to do so is simple:

sed -e '/PTEX\./s/./ /g' < submitted.pdf > arXiv.pdf

will produce a file named arXiv.pdf from your original PDF file submitted.pdf

Took me half an hour to figure that out in detail, and half an hour to write. Maybe it can save some other poor academics this same amount of time! Let me know in the comments if you ever had trouble of the sort…

Saturday 24 May 2014

Publishing chemical structures in the 21st century

This entry is somewhere between a rant and a call for comments, but it is on a topic that is close to my heart, and which I think my interest a few others: reproducibility of science and publication practices. During the past two weeks, I've been annoyed a few times by trying to reproduce (or build upon) published computational work, by looking at the structures the authors had worked on/reported/predicted. And there, I was sorely disappointed: in many cases, the structures are not readily available!__

Out of the seven papers I've had to work with recently (all published in 2009 or later), here are the various behaviors I have observed:

  1. structures described in short format in the paper itself: Structure
  2. full listing of atomic positions in supporting information, in PDF format
  3. a screenshot (bitmap image) of a full listing of atomic positions is included as PDF supporting information
  4. in one case, the structures were not included at all: only their unit cell parameters were given (and a reference for the experimental crystallographic structure from which calculations were started)

What bothers me is that, in all cases, it takes a non-trivial amount of time to produce a structure file, either by copy-pasting information or retyping it, while it would have cost the authors nothing to publish the structures in a standard text-based data-minable format: CIF file for crystals, XYZ for molecules, CML if you like it, etc. This can be achieved either by publishing it as supporting information, or by depositing it in a database. As a referee, I would definitely have flagged that in my review, in the name of reproducibility and good scientific practices.

So this was the rant part. Now, the call for comments: given that the practice outlined above endures, I wonder: what are arguments against this? And in particular, how is the standard for computational/theoretical chemistry so different from, e.g., experimental crystallography (where deposition into databases is the norm)?

As author, as referee and/or as editor, what is your point of view on this? Is mine a minority view, or are things the way they are simply because of the system's inertia?

Thursday 24 April 2014

Creating a Twitter bot to survey the literature for me (& others?)

I'm starting an experiment… like some Twitter colleagues around me have done in their respective field: I'm creating a twitterbot to survey the MOF (metal–organic frameworks) literature for me. It's an attempt to try and bring to Twitter (which I use more and more) my earlier workflow for keeping an eye on the literature. I used to subscribe to RSS feeds from certain key journals, and browse through them when I would have the time. The upside is that now and then you read stuff that's outside of your own research subfield. The downside… is that it's very time consuming. So, I could only follow a few journals…

Exponential growth of MOF papers

Exponential growth of MOF papers…

How does it work?

The original inspiration dates back a few months: I became aware of this possibly through Sylvain Deville's announcement of his IT_papers bot. But I didn't exactly follow the "established" methodology… All the blog posts I could find about setting up a twitterbot for scientific literature rely on keyword-based queries of databases (Pubmed, Google Scholar, etc.). I didn't want to follow this approach, so instead my bot relies on filtering RSS feeds through Yahoo pipes. The Pipes workflow is very simple:

Yahoo Pipes workflow

Simply copy-paste a large number of journals' RSS feeds (I've got all potential RSC, ACS and Wiley journals covered… I'm probably missing some from Elsevier, but they're not as active in my field). Join all, filter titles and abstracts for specific keywords, and… voilà! I then used dlvr.it to post the resulting RSS feed to Twitter.

The future

Given the large number of MOF papers published every day (figure 1 above), I don't know how manageable the resulting feed will be, and whether I'll end up using it as my tool for staying up-to-date on the torrent of MOF literature…

More importantly, I don't know if others will find it useful. So, I welcome all feedback on this initiative, whether through comments below this entry, Twitter messages, etc.

Wednesday 26 February 2014

Globalization of chemistry over 5 decades (part 2)

This article is the second part of a series on the evolution of chemistry papers between 1961 and 2011, in which I play with data from JACS papers. Part 1 is here.

In the previous post, I looked at the inflation of authors and references in chemistry writing that has taken place since the 60's. As Matteo Cavalleri put it: “More of everything”. Reflecting back, one of the surprises (to me) was that the phenomenon is not recent, but has been rather progressive since the 60's. As a researcher who's been in academia for 10 years, I thought it was had begun one or two decades ago… but this is a long-lived trend.

Going forward, today we will explore the “world” of chemistry authors and publishers: who writes papers in JACS? what's their diversity (in terms of affiliations, country, etc)? How much did globalization affect chemistry research & writing?

Author diversity

So, the number of authors for a given paper increases over time, as does the number of affiliations… and since the absolute number of paper increased widely in the meantime (1364 papers in 1961 vs. 3176 in 2011), there are necessarily more authors in JACS today than 5 decades ago. But those authors are not all different… and I wondered whether JACS is, in part, a cozy “club” with members publishing multiple papers a year, or whether publication in JACS is a rare event in the chemist's typical year (I know the answer for computational chemistry, sure, but I don't know much about the publishing habits of, e.g., organic chemists). So, here's the number of papers in JACS, per author and for one given year:

Multiple authorship in JACS

As expected, the majority of authors published only one JACS paper in one year (which, of course, doesn't mean that people publish on average one JACS paper per year…). However, the proportion of "multiple papers" authors is far from negligible: 26% in 1961, and 14% in 2011. I actually quite like the idea that this number is going down over time, because I interpret it as a sign that JACS authorship is more becoming more diverse.

I also looked at the 1% (of authors with the most papers per year): there is relatively little change there. To be in the 1% in 1961, you needed 6 papers in a year, while you needed only 5 in 2011. For the anecdotal value, the most prolific authors in 1961 was Herbert C. Brown (at Purdue University), with 16 JACS papers (out of the 538 of his entire career), including some of a 23-part series of articles (here's “Hydroboration. XXIII”, for example)! In 2011, the title goes to Shunichi Fukuzumi, with 15 papers.


In 1961, the Journal of the American Chemical Society was almost exactly that. I don't have systematic affiliation data, so I cannot analyze the geographic distribution, but the top 9 authors were North-American (all working in the US, except Canadian chemist Saul Winstein). The #10 was British, and a few Europeans start to appear in the list after that point. Among the “top 5%“ authors, all were North American or European.

Fast forward to 2011, the top 3 authors were Japanese. The top of the list is dominated by US and Japanses chemists, with sparse European presence. The first Korean author appears at #21 (Wonwoo Nam), tied with the first Chinese (Lei Liu; 7 papers each). I didn't find any Indian colleague in the “top 1%”.

To do look at this in a more quantitative manner, let's look at the distribution of countries (this is not per author, but per affiliation):

Affiliation distribution per country, JACS

(In the legend, only countries with ≥3 papers are featured). Obviously, over the course of 30 years, globalisation had a large impact on authorship: the US share of authors has gone down from two-thirds to a small half, while other countries have progressed. China went from 0.1% of papers to 7%. Europe, as a whole, has grown from 14% to to 25%; diversity within European countries has also increased.

Still, the US dominance is quite prevalent, and China's slice of the whole is rather small… It seems that, even today, there still remain an overly large prevalence of American authors in the Journal of the American Chemical Society. Which is not, after all, a bad thing in itself (other countries have their specific journals), but can reinforce bias when bibliometric factors are used in evaluation of researchers.

Gender balance

Oh, just kidding. I'd love to study this further, but given that no systematic data is available, there's not much I can do. I scrolled the list of 1961 and 1981 top authors looking for female scientists, and I got bored before I found any. In 2011, there were 3 female chemists among the first 36 authors (8%): Naomi Mizorogi (University of Tsukuba), Wei Wang (UCLA), and Melanie Sanford (University of Michigan).

Friday 21 February 2014

Evolution of chemistry writing over 5 decades (part 1)

If there's one thing that a scientist likes, it's playing with data! To do some quantitative analysis of how publication in chemistry has evolved over the years, I looked at the evolution of JACS articles spanning 5 decades, with statistics on published papers in 1961, 1971, 1981, 1991, 2001 and 2011. So, what can we say?

Average number of authors

JACS authors per paper

The histograms above show the number of authors for a given paper. There is a clear trend towards papers with longer author lists, with the average number of authors increasing from 2.4 (in 1961) to 5.3 (in 2011). Of particular note is the quasi-disappearance of single author papers: while they represented 13.7% of published papers in 1961 (188 out of 1364), there were only 10 in 2011 (out of 3176 papers, i.e. 0.3%).

This is in line with the general idea of modern research being more collaborative, which we can try to confirm by looking at the number of different affiliations per paper (only plotted from 1981 to 2011, as earlier data on affiliation is not available):

JACS affiliations per paper

The average number of affiliations increased from 1.3 to 2.4 over this period, somewhat faster than the increase in authorship. I feel, however, it's important to note there still are large number of single-affiliation papers, which I did not exactly expect (most of my own papers being part of collaborations).

Number of references

Here's the number of references per paper, in 1961, 1991 and 2011:

References per JACS paper

The clear increase in references (average of 18.6 in 1961 vs. 49.1 in 2011) is clear, and is (at least partly) responsible for the background inflation of impact factors, for example. However, it should be noted that the increase has been gradual since 1961, so it cannot be driven by the more recent focus on bibliometrics.

That's it for today, my time is up (rugby's over), but next time I'll focus on the writing of chemistry itself: how the language of titles and abstracts has evolved since 1961… In the meantime, comments on the statistics above or suggestions of additional analyses are very welcome!