How it works

After choosing to participate as described above, CVs undergo a multi-step process to transform PDF content into data within ScholarConnect:

  1. Extract PDF content: text, formatting, page number, and position information are extracted from the PDF.
  2. Extract sample headings: the first few section headings are identified using GPT 4o vision. Only the first 1-2 pages of the PDF are analyzed.
  3. Visual analysis: heading formatting, position, and other styling are analyzed to identify visual characteristics common across the sample headings.
  4. Identify all headings: all headings are identified using visual characteristics identified in the previous step.
  5. Identify publications section: GPT 4o reviews the list of headings identified and assists with picking the main headings listed in the CV. Importantly it identifies the publications section, inclusive of any books, conference proceedings, journal articles, etc...
  6. Extract bibliographic data: Text content of the publications section is provided to GPT 4o, transforming unstructured text into structured data (JSON). Text is "chunked" and processed iteratively to avoid output token limitations of GPT 4o.
  7. Extract other data: GPT 4o is provided the entire CV document and extracts data such as the individual's name, current positions, degrees, distinctions, contact info, and professional society affiliations.
  8. Identify areas of expertise: GPT 4o is employed again to identify 10 areas of expertise.

The result is a large structured data file (JSON) representing most of the CV content. This data is then published to scholarconnect.uci.edu, while we email the relevant faculty a link to their profile for review. We hope to further refine these steps with faculty input so that little editing is required.

The Team

team-peter

ScholarConnect is led by Institutional Research, Assessment, and Planning, in close partnership with the Offices of Academic Personnel and Information Technology.

Project Champion & Sponsor

Roxane Silver

Vice Provost for Institutional Research, Assessment, and Planning

Pilot School Sponsor

James Bullock

Dean, School of Physical Sciences

Executive Sponsor

Tom Andriola

Vice Chancellor, Data and Information Technology

Kian Colestock

AVC and CIO, Office of Information Technology

Project Leadership

Andrea Bell

Strategic Program Director, Institutional Research, Assessment, and Planning

Max Garrick

Acting Director, Student & Academic Services

Software Engineering

Tarique Shams

AI Software Engineer, Chancellor & Provost IT, Office of Information Technology