Skip to content

JSON Objects that contain people profile information collected from xing.com

Notifications You must be signed in to change notification settings

MilkaLichtblau/xing_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset XING57/2017

This dataset contains anonymized user profiles collected from xing.com in response to 57 queries. It was used in Jan-Feb 2017 to study gender biases in the returned ranked search results given each user's profile details.

Version

//\ date and commit

Data format

The results can be found in ~/data as JSON Files. Each file contains information of the first 40 profiles as seen on the first two result pages of the respective query. The information we processed was

  • duration of job experiences
  • duration of education
  • sex

File Names

SHAano ## start#-end#

  1. anonymized Dataset (Names, hyperlinks and pictures belonging to a profile have been removed or replaced with a hash value.
  2. ordered number of query as in the list below
  3. e.g. 1001-1040 result number

Sample profile Structure

{
  "category":
  "dominantSexXing":
  "profiles": [
    {
      "profile": [
        {
          "sex":
          "memberSince_Hits":
          "currJobDescr":
          "jobs": [
            {
              "jobTitle":
              "company":
              "company_url":
              "jobDuration":
              "jobDates":
            },
          ]
        }
      ],
      "languages": [
        {
        }
      ],
      "education": [
        {
          "institution":
          "url":
          "degree":
          "eduDuration":
        }
      ]
      "awards": [
        {
        }
      ]
    }

Queries

The following queries were used with reference to these statistics 1, 2 targeting a diversified collection of specific job titles in the respective career field while excluding jobs underrepresented on XING such as construction worker, farmer, etc. The order of the queries represents the order in the file naming convention

  1. Administrative Assistant
  2. Auditing Clerk
  3. Auditor
  4. accountant
  5. bank teller
  6. treasurer
  7. actuary
  8. budget analyst
  9. economist
  10. mathematician
  11. statistician
  12. Events Coordinator
  13. Office Manager
  14. Secretary
  15. Dental Assistant
  16. Medical Assistant
  17. Receptionist
  18. Audiologist
  19. Daycare
  20. lawyer
  21. legal advisor

  1. Application Developer
  2. Building Inspector
  3. Application Support Analyst
  4. Civil Engineer
  5. Back end Developer
  6. Chemical Engineer
  7. Construction Engineer
  8. Data Analyst
  9. Contract Administrator
  10. Database Administrator
  11. Field Engineer
  12. Front End Developer
  13. Mechanical Engineer
  14. Safety Manager
  15. Software Engineer
  16. Superintendent
  17. System Administrator

  1. Technical Support Specialist
  2. Account Coordinator
  3. Account Executive
  4. Advertising Director
  5. Art Director
  6. Brand Assistant*
  7. Brand Manager*
  8. Brand Strategist*
  9. Copywriter
  10. creative director
  11. Internet Marketing Coordinator
  12. Market Research Analyst
  13. Marketing Associate
  14. Online Product Manager
  15. Public Relations Representative
  16. Public Relations Specialist
  17. SEO Manager
  18. Social Media Marketing Coordinator
  19. Architect
Please note that Brand is also a Family name in Germany

About The results

  • searches have been performed in English without logging in to ensure that the results sorting is not tailored to a specific profile After generating the results, each profile has been parsed in full detail (while logged in).
  • The sex of a person was manually derived from the profile name and picture since it is not given on the profile. This helped us filter irrelevant information such as fake profiles or profiles with misleading information (e.g. containing details about a company instead of a person).
  • 19 queries returned duplicate entries. In most cases these would show one position apart. In such cases the latter was removed, resulting in a few results to include less than 40 profiles. Details such as company or institution name were anonymized using SHA-256 to only be able to differenciate between people who worked or studied at the same place or find other patterns.
  • currJob is always equal to first element in pastJobs
  • If a profile was found to be employed or studying at the time the data was collected, we replaced the date.
  • profiles with incomplete data, in particular with missing dates have been considered as such: //\ add it to code instead? * If a job or education entry has no name it counts for an average of 3 months

Code

The code in src/ reads the information from all JSON files into a python dataframe that can be used later on. Currently it is simply dumped to disk. To use it, you can execute these commands:

//\

Citation

If you use this dataset, please cite:

Zehlike, Meike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. "Fa* ir: A fair top-k ranking algorithm." In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569-1578. ACM, 2017.

BibTeX Entry:

@inproceedings{zehlike2017fair,
  title={Fa* ir: A fair top-k ranking algorithm},
  author={Zehlike, Meike and Bonchi, Francesco and Castillo, Carlos and Hajian, Sara and Megahed, Mohamed and Baeza-Yates, Ricardo},
  booktitle={Proceedings of the 2017 ACM on Conference on Information and Knowledge Management},
  pages={1569--1578},
  year={2017},
  organization={ACM}
}

The authors are not associated to XING in any way.

About

JSON Objects that contain people profile information collected from xing.com

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages