It is often the case that the metadata (e.g. title, authors, etc) that is automatically extracted by Google Scholar's robot is incorrect. There is a way to give Google Scholar the correct bibliographic metadata of your publications.
The basic idea is to create one HTML web page per paper in addition to the PDF URL. This page must include certain HTML metadata tags as shown below and link to the PDF of full version of the paper (Official doc: http://scholar.google.com/intl/en/scholar/inclusion.html).
Disclaimer: the information in this post is a guess and does not work deterministically. It seems that the official guidelines do not reflect the reality of what's implemented in Google Scholar
Recommendation: Instead of doing this manually use a bibliographic server, I recommend Arxiv, Zenodo or one publication server of this list: https://www.base-search.net/about/en/about_countries_land_up.php
Dimensions of Indexing
There are two dimensions of indexing bibliographic metadata.First, Google Scholar distinguishes between entries for which it has a full version, and those for which it was not able to find one (metadata-only record). The latter are prefixed by [CITATION] in the search results. In this document, they say "What are the results marked [citation] and why can't I click on them? These are articles which other scholarly articles have referred to, but which we haven't found online. To exclude them from your search results, uncheck the 'include citations' box on the left sidebar.". It seems that Google Scholar's process and criteria to index metadata-only records is different from the process to index full papers and their metadata (which are on two different URLs).
Second, as in BibTeX, there are different kinds of entries: journal articles, conference papers, technical papers, etc. Some kinds of metadata are generic (e.g. citation_author), others are specific ("citation_journal_title" for journal articles or "citation_technical_report_institution" for tech reports).
Finally, note that contrary to standard web indexing where every URL is indexed, Google Scholar has the concept of "work": several URLs and formats are instances of the same work. For instance, foo.edu/paper-xyz.pdf, bar.edu/paper-xyz-2012.pdf, glop.edu/paper-xyz.html would be recognized as the same work even if they come from different domains (foo.edu and bar.edu), they have different file names (paper-xyz-2012.pdf and paper-xyz.pdf), and they are under different formats (PDF and HTML). If the content is very similar, Scholar treats them as the same work.
Example
See also example in the official guidelines.
<html>Official metatags extracted from the guidelines:
<head>
<meta name="citation_title" content="Crystal structure of squid rhodopsin" />
<meta name="citation_author" content="Murakami, Midori" />
<meta name="citation_author" content="Kouyama, Tsutomu" />
<meta name="citation_publication_date" content="1999" />
<meta name="citation_pdf_url" content="crist_struct.pdf" />
</head>
<body>
<h1>Crystal structure of squid rhodopsin</h1>
Abstract:
Proin elit arcu, rutrum commodo, vehicula tempus, commodo a, risus.
Curabitur nec arcu. Donec sollicitudin mi sit amet mauris. Nam elementum
quam ullamcorper ante. Etiam aliquet massa et lorem. Mauris dapibus lacus
auctor risus. Aenean tempor ullamcorper leo. Vivamus sed magna quis ligula
eleifend adipiscing. Duis orci. Aliquam sodales tortor vitae ipsum.
</body>
</html>
Meta tag name | Meaning |
---|---|
citation_title | The paper title |
citation_publication_date | The official publication date in the "2010/5/12" or year alone "2010" format |
citation_online_date | The online publication date in the "2010/5/12" or year alone "2010" format |
citation_author (multiple allowed) | An author name. Multiple occurrencees of this tag are allowed (see example above). |
citation_pdf_url | The full paper URL (see FAQ) |
citation_conference_title | The conference name or the proceedings title (for conference and workshop papers) |
citation_journal_title | The journal name (for journal papers) |
citation_volume | The volume (for journal papers) |
citation_issue | The issue number (for journal papers) |
citation_issn | The journal ISSN (for journal papers) |
citation_isbn | ISBN number |
citation_firstpage | The first page of the article |
citation_lastpage | The last page of the article |
citation_dissertation_institution | The university name (for master's and Ph.D. thesis) |
citation_technical_report_institution | The institution name (for technical reports) |
citation_technical_report_number | The technical report number (for technical reports) |
FAQ
Is the abstract mandatory? No. I've seen many pages that are indexed with no abstracts.
What is the best way to specify the publication date? I don't know. The official guidelines say "citation_publication_date", most publishers use "citation_date", and some "citation_year".
What is the format of "citation_pdf_url"? I don't know. According to my experiments and the survey, it seems that it must end by ".pdf", it seems that having a relative URL with only the file name helps (seems related to the security mention in the official guidelines). (Note that the citation_pdf_url does not necessarily point to a PDF according to https://github.com/DSpace/DSpace/pull/379: "The mimetype of the bitstream is not taken into account, as per Google Scholar feedback that this is no longer as important to them.")
Is "citation_pdf_url" mandatory? It is not clear. According to Sandsfish's comment, the answer is yes:
"Google Scholar has said they are not interested in having citation tags for an item if [citation_pdf_url] field is not provided for."
Can I use "citation_authors" (plural)? Avoid. That tag "citation_authors" (mentioned in this Google's email) is deprecated (information given by Google Scholar Support).
What about the other citation tags? There are many of them in the wild. The tags that aren't listed in the guidelines are not not officially supported, so their effect on indexing in Google Scholar should be considered undefined (information given by Google Scholar Support): citation_date, citation_conference, citation_doi, citation_abstract_html_url, citation_fulltext_html_url, citation_publisher, citation_language, citation_pmid, citation_keywords, citation_dissertation_name, citation_patent_number, citation_patent_country.
What are the most used citation tags? See https://github.com/hubgit/cite-urls for the result of an interesting study.
References
- bibtexbrowser generates automatically these metadata from bibtex files.
- this post at crossref.org
- Publish or Perish information page
- this post at reallywow.com
Survey
http://www.nature.com/nature/journal/v362/n6423/abs/362801a0.html
[2] http://restud.oxfordjournals.org/content/58/2/277.short
[3] http://apm.sagepub.com/content/1/3/385.short
[4] http://nutrition.highwire.org/content/132/12/3577.short
[5] http://www.sciencemag.org/content/239/4839/487.short
[6] http://www.biomedcentral.com/1465-6906/5/R80]) (redirects to [[http://genomebiology.com/content/5/10/R80)
[7] http://ideas.repec.org/c/boc/bocode/s432001.html
[8] http://dspace.mit.edu/handle/1721.1/84478
[9] http://www.cabdirect.org/abstracts/20113106615.html
[10] http://documents.irevues.inist.fr/handle/2042/52565
[11] http://cat.inist.fr/?aModele=afficheN&cpsidt=27176067
[12] http://www.computer.org/csdl/proceedings/icpr/2004/2128/03/212830590-abs.html