Accurate bibliographic metadata and google scholar

by Martin Monperrus
Disclaimer: the information in this post is a guess and does not work deterministically. It seems that the official guidelines do not reflect the reality of what's implemented in Google Scholar

It is often the case that the metadata that is automatically extracted by Google Scholar's robot (e.g. title, authors, etc) is incorrect. There is a way to give Google Scholar the correct bibliographic metadata of your publications.

The official guidelines are published at http://scholar.google.com/intl/en/scholar/inclusion.html.

The basic idea is to create one HTML web page per paper in addition to the PDF URL. This page must include certain HTML metadata tags as shown below and link to the PDF of full version of the paper.

Dimensions of Indexing

There are two dimensions of indexing bibliographic metadata.

First, Google Scholar distinguishes between entries for which it has a full version, and those for which it was not able to find one (metadata-only record). The latter are prefixed by [CITATION] in the search results. In this document, they say "What are the results marked [citation] and why can't I click on them? These are articles which other scholarly articles have referred to, but which we haven't found online. To exclude them from your search results, uncheck the 'include citations' box on the left sidebar.". It seems that Google Scholar's process and criteria to index metadata-only records is different from the process to index full papers and their metadata (which are on two different URLs).

Second, as in BibTeX, there are different kinds of entries: journal articles, conference papers, technical papers, etc. Some kinds of metadata are generic (e.g. citation_author), others are specific ("citation_journal_title" for journal articles or "citation_technical_report_institution" for tech reports).

Finally, note that contrary to standard web indexing where every URL is indexed, Google Scholar has the concept of "work": several URLs and formats are instances of the same work. For instance, foo.edu/paper-xyz.pdf, bar.edu/paper-xyz-2012.pdf, glop.edu/paper-xyz.html would be recognized as the same work even if they come from different domains (foo.edu and bar.edu), they have different file names (paper-xyz-2012.pdf and paper-xyz.pdf), and they are under different formats (PDF and HTML). If the content is very similar, Scholar treats them as the same work.

Example

<html>
<head>
<title>Crystal structure of squid rhodopsin</title>
<meta name="citation_title" content="Crystal structure of squid rhodopsin" />
<meta name="citation_publication_date" content="1999" />
<meta name="citation_author" content="Murakami, Midori" />
<meta name="citation_author" content="Kouyama, Tsutomu" />
<meta name="citation_pdf_url" content="crist_struct.pdf" />
</head>
<body>
<h1>Crystal structure of squid rhodopsin</h1>
Abstract:
Proin elit arcu, rutrum commodo, vehicula tempus, commodo a, risus.
Curabitur nec arcu. Donec sollicitudin mi sit amet mauris. Nam elementum
quam ullamcorper ante. Etiam aliquet massa et lorem. Mauris dapibus lacus
auctor risus. Aenean tempor ullamcorper leo. Vivamus sed magna quis ligula
eleifend adipiscing. Duis orci. Aliquam sodales tortor vitae ipsum.
</body>
</html>

Official metatags extracted from the guidelines:


Meta tag name Meaning
citation_title The paper title
citation_publication_date The official publication date
citation_online_date The online publication date
citation_author (multiple allowed) An author name. Multiple occurrencees of this tag are allowed
(see example above).
citation_pdf_url The full paper URL (see FAQ)
citation_conference_title The conference name or the proceedings title (for conference
and workshop papers)
citation_journal_title The journal name (for journal papers)
citation_volume The volume (for journal papers)
citation_issue The issue number (for journal papers)
citation_issn The journal ISSN (for journal papers)
citation_isbn ISBN number
citation_firstpage The first page of the article
citation_lastpage The last page of the article
citation_dissertation_institution The university name (for master's and Ph.D. thesis)
citation_technical_report_institution The institution name (for technical reports)
citation_technical_report_number The technical report number (for technical reports)

FAQ


Is the abstract mandatory? No. I've seen many pages that are indexed with no abstracts.

What is the best way to specify the publication date? I don't know. The official guidelines say "citation_publication_date", most publishers use "citation_date", and some "citation_year".

What is the format of "citation_pdf_url"? I don't know. According to my experiments and the survey, it seems that it must end by ".pdf", it seems that having a relative URL with only the file name helps (seems related to the security mention in the official guidelines). (Note that the citation_pdf_url does not necessarily point to a PDF according to https://github.com/DSpace/DSpace/pull/379: "The mimetype of the bitstream is not taken into account, as per Google Scholar feedback that this is no longer as important to them.")

Is "citation_pdf_url" mandatory? It is not clear. According to Sandsfish's comment, the answer is yes:
"Google Scholar has said they are not interested in having citation tags for an item if [citation_pdf_url] field is not provided for."

Can I use "citation_authors" (plural)? Avoid. That tag "citation_authors" (mentioned in this Google's email) is deprecated (information given by Google Scholar Support).

What about the other citation tags? There are many of them in the wild. The tags that aren't listed in the guidelines are not not officially supported, so their effect on indexing in Google Scholar should be considered undefined (information given by Google Scholar Support): citation_date, citation_conference, citation_doi, citation_abstract_html_url, citation_fulltext_html_url, citation_publisher, citation_language, citation_pmid, citation_keywords, citation_dissertation_name, citation_patent_number, citation_patent_country.

What are the most used citation tags? See https://github.com/hubgit/cite-urls for the result of an interesting study.

References



Survey




Nature [1]

OUP [2]

Sage [3]

HighWire [4]

Science [5]

BMC [6]

Repec

[7]

Dspace [8]

Cabdirect [9]

Inist [10]

Inist [11]

citation_title

x

x

x

x

x

x

x

x

x

x

x

citation_authors

x






x



x

x

citation_author


x

x

x

x

x



x



citation_author_institution





x







citation_date

x

x

x

x

x

x


x

x

x


citation_year







x




x

citation_publication_date







x





citation_journal_title

x

x

x

x

x

x

x


x


x

citation_volume

x

x

x

x

x

x



x


x

citation_issue

x

x

x

x

x

x



x


x

citation_firstpage

x

x

x

x

x






x

citation_lastpage


x

x

x

x






x

citation_doi

x

x

x



x






citation_publisher

x





x

x

x


x

x

citation_abstract_html_url


x

x

x

x

x

x

x


x


citation_abstract_pdf_url







x



x


citation_pdf_url


x

x

x

x

x


x




citation_fulltext_html_url




x


x






citation_fulltext_world_readable

x











citation_journal_abbrev


x

x

x

x







citation_issn


x

x

x

x

x



x


x

citation_isbn











x

citation_id


x

x

x

x






x

citation_id_from_sass_path


x

x

x

x







citation_mjid


x

x

x

x







citation_pmid




x

x

x






citation_public_url


x

x

x

x







citation_section


x

x

x

x







citation_language









x

x

x

citation_keywords







x

x

x

x

x

citation_abstract







x





citation_conference











x

[1] http://www.nature.com/nature/journal/v362/n6423/abs/362801a0.html
[2] http://restud.oxfordjournals.org/content/58/2/277.short
[3] http://apm.sagepub.com/content/1/3/385.short
[4] http://nutrition.highwire.org/content/132/12/3577.short
[5] http://www.sciencemag.org/content/239/4839/487.short
[6] http://www.biomedcentral.com/1465-6906/5/R80]) (redirects to [[http://genomebiology.com/content/5/10/R80)
[7] http://ideas.repec.org/c/boc/bocode/s432001.html
[8] http://dspace.mit.edu/handle/1721.1/84478
[9] http://www.cabdirect.org/abstracts/20113106615.html
[10] http://documents.irevues.inist.fr/handle/2042/52565
[11] http://cat.inist.fr/?aModele=afficheN&cpsidt=27176067
[12] http://www.computer.org/csdl/proceedings/icpr/2004/2128/03/212830590-abs.html
Tagged as: