I want to carry out an authorship analysis by means of complex networks. Therefore, I downloaded data from Scopus as CSV file. Each node (that is author) will be identified from the combination of name and affiliation code, which can be something like "University of London". Thus, the result is not biased from author of the same name. It is easy to extract the same author name but not that easy for the affiliation, because the affiliation strings have not any standard structure. They appear in many forms, like: "university of XXX…", "XXX university…", "Department of YYY…", acronym of the department, the address is not always included, etc. In few cases, the affiliations lack of details, therefore it is simply "university of XXX". This makes the rather challinging to assign to each affiliation string the affiliation code. I partially solved the problem using the following approach: 1- Manually definition a word bank for each affiliation, which can be (street name, city, acronym of the deparment, etc) 2- Separating each affiliation string in substrings of single words 3- Each substring set was compared with the word bank of each affiliation and likely the affiliation is the one where the intersection with the relative word bank is the largest.
Unfortunately, this approach doesn't work as good as expected, in many the affiliation code is wrongly assigned and it requires more manual work than I thought. So which can be an improved method than the adopted one?