This tutorial was partially adapted from http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial, where you can learn more about Open Refine. It used to be called Google Refine so try that too when you are searching for information. However, in our case, we use python to do the same thing.
Downloading Data
The university data can be downloaed from http://enipedia.tudelft.nl/enipedia/images/f/ff/UniversityData.zip
What you can learn
The data contains quite a few issues, and this tutorial shows how to do things like:
- Cleaning up inconsistent spelling of terms (i.e. “USA”, “U.S.A”, “U.S.”, etc).
- Converting values that are text descriptions of numeric values (i.e. $123 million) to actual numeric values (i.e. 123000000) which are usable for analysis.
- Identifying which rows of a specific column contain a search term
- Extracting and cleaning values for dates
- Removing duplicate rows
- Using a scatterplot to visualize relationships between values in different columns
Finding geographic coordinates for a list of place names (i.e. the names of universities, etc.)- Exporting cleaned data to Excel
Reading Data with Pandas
%matplotlib inline
import sys
import pandas as pd
import statsmodels.api as sm
from collections import Counter, defaultdict
import numpy as np
import datetime
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
df = pd.read_csv('universityData.csv', sep = '\t')
df.head()
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Paris Universitas | 15 | 5500 | 8000 | France | NaN | 2005 | NaN | 25000 | 70000 |
1 | Paris Universitas | 15 | 5500 | 8000 | France | NaN | 2005 | NaN | 25000 | 70000 |
2 | Lumi%C3%A8re University Lyon 2 | 121 | NaN | 1355 | France | NaN | 1835 | 7046 | 14851 | 27393 |
3 | Confederation College | 4700000 | NaN | NaN | Canada | NaN | 1967 | not available | pre-university students; technical | 21160 |
4 | Rocky Mountain College | 16586100 | NaN | NaN | United States | NaN | 1878 | 66 | 878 | 894 |
print df.university[2]
Lumi%C3%A8re University Lyon 2
Apparently, there are duplicate rows.
len(df)
75043
Deduplicate entries 1.0
df = df.drop_duplicates()
len(df)
15534
We already see an issue here where there is both the full name of a country (United States) and its abbreviation (US). To fix this, we can just copy/paste “United States” as the new cell value.
Clean up country names
df['country'].unique()
array(['France', 'Canada', 'United States', 'USA', 'Italy', 'South Korea',
'Japan', 'United States of America', 'England, UK',
'United States )', 'Saudi Arabia', 'Honduras', 'United Kingdom',
'England', 'the Netherlands', 'India', 'Russia', 'U.S.', 'Brazil',
'US', 'U.S.A.', 'Philippines', 'Australia',
'England, United Kingdom', 'Wales', ',', 'China', 'South Africa',
'UK', 'Puerto Rico', 'Bulgaria', 'Botswana', 'Taiwan', 'Sri Lanka',
'Colombia', 'Iran', 'Russian Federation', 'Rossija', 'Finland',
'Republic of China', 'Chile', 'Romania', 'Utopia', 'Singapore',
'Lebanon', 'Turkey', 'Canada B1P 6L2', 'Jordan', 'Albania',
'Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795',
'Scotland', 'Pakistan', 'Scotland, United Kingdom', 'Denmark',
'Mexico', 'Thailand', 'Argentina', 'Cura%C3%A7ao', 'Sweden',
'Scotland, UK', 'Bangladesh', 'Nepal', 'Netherlands', 'Switzerland',
'Egypt', 'Spain', 'Satellite locations:',
'Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL'], dtype=object)
country_df = df.groupby('country').size()
for k in country_df.index:
print(k, country_df[k])
(',', 2)
('Albania', 8)
('Argentina', 1)
('Australia', 49)
('Bangladesh', 27)
('Botswana', 1)
('Brazil', 10)
('Bulgaria', 2)
('Canada', 625)
('Canada B1P 6L2', 40)
('Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795', 1)
('Chile', 1)
('China', 4)
('Colombia', 1)
('Cura%C3%A7ao', 1)
('Denmark', 8)
('Egypt', 1)
('England', 338)
('England, UK', 286)
('England, United Kingdom', 64)
('Finland', 2)
('France', 8)
('Honduras', 1)
('India', 40)
('Iran', 9)
('Italy', 2)
('Japan', 112)
('Jordan', 2)
('Lebanon', 5)
('Mexico', 2)
('Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL', 1)
('Nepal', 4)
('Netherlands', 2)
('Pakistan', 2)
('Philippines', 147)
('Puerto Rico', 2)
('Republic of China', 1)
('Romania', 1)
('Rossija', 2)
('Russia', 1)
('Russian Federation', 2)
('Satellite locations:', 1)
('Saudi Arabia', 1)
('Scotland', 64)
('Scotland, UK', 16)
('Scotland, United Kingdom', 16)
('Singapore', 4)
('South Africa', 7)
('South Korea', 7)
('Spain', 1)
('Sri Lanka', 4)
('Sweden', 24)
('Switzerland', 8)
('Taiwan', 1)
('Thailand', 4)
('Turkey', 4)
('U.S.', 304)
('U.S.A.', 81)
('UK', 83)
('US', 455)
('USA', 5104)
('United Kingdom', 405)
('United States', 6949)
('United States )', 1)
('United States of America', 167)
('Utopia', 1)
('Wales', 2)
('the Netherlands', 2)
us_condition = df['country'].isin(['U.S.', 'U.S.A.', 'US', 'United States', 'United States )', 'United States of America'])
df['country'][us_condition] = 'USA'
ca_condition =[]
for i in df['country']:
if i.__contains__('Canada'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'Canada'
ca_condition =[]
for i in df['country']:
if i.__contains__('China'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'China'
ca_condition =[]
for i in df['country']:
if i.__contains__('England'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'England'
ca_condition =[]
for i in df['country']:
if i.__contains__('Scotland'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'England'
uk_condition = df['country'].isin(['UK', 'United Kingdom', 'Wales'])
df['country'][uk_condition] = 'England'
ca_condition =[]
for i in df['country']:
if i.__contains__('Netherlands'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'Netherlands'
ca_condition =[]
for i in df['country']:
if i.__contains__('Russia'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'Russia'
ca_condition =[]
for i in df['country']:
if i.__contains__('Bahamas'):
ca_condition.append(True)
else:
ca_condition.append(False)
df['country'][ca_condition] = 'Bahamas'
'Russian Federation'.__contains__('Russia')
True
uk_condition = df['country']=='Cura%C3%A7ao'
df['country'][uk_condition] = "Curacao"
df[df['country']==',']
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
2130 | Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco | 23760.0 | 2188 | NaN | , | 1087 | 1958-11-20 | 971 | 24921 | 35271 |
3077 | Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco | 34140.0 | 2188 | NaN | , | 1087 | 1958-11-20 | 971 | 24921 | 35271 |
Universidad Juárez Autónoma de Tabasco is a public institution of higher learning located in Villahermosa, Tabasco, Mexico.
uk_condition = df['country']==','
df['country'][uk_condition] = "Mexico"
df[df['country']=='Satellite locations:']
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
75009 | Nova Southeastern University | US $64.5 million | 2083 | NaN | Satellite locations: | 4319 | 1964 | 22060 | 6397 | 28457 |
Nova Southeastern University (NSU) is a private nonprofit university, with a main campus located on 300 acres (120 ha) in Davie, in the US state of Florida. Formerly referred to as “Nova” and now commonly called “NSU”, the university currently consists of 18 colleges and schools offering over 175 programs of study with more than 250 majors.
uk_condition = df['country']=='Satellite locations:'
df['country'][uk_condition] = "USA"
country_df = df.groupby('country').size()
for k in country_df.index:
print(k, country_df[k])
('Albania', 8)
('Argentina', 1)
('Australia', 49)
('Bahamas', 1)
('Bangladesh', 27)
('Botswana', 1)
('Brazil', 10)
('Bulgaria', 2)
('Canada', 666)
('Chile', 1)
('China', 5)
('Colombia', 1)
('Curacao', 1)
('Denmark', 8)
('Egypt', 1)
('England', 1274)
('Finland', 2)
('France', 8)
('Honduras', 1)
('India', 40)
('Iran', 9)
('Italy', 2)
('Japan', 112)
('Jordan', 2)
('Lebanon', 5)
('Mexico', 4)
('Nepal', 4)
('Netherlands', 4)
('Pakistan', 2)
('Philippines', 147)
('Puerto Rico', 2)
('Romania', 1)
('Rossija', 2)
('Russia', 3)
('Saudi Arabia', 1)
('Singapore', 4)
('South Africa', 7)
('South Korea', 7)
('Spain', 1)
('Sri Lanka', 4)
('Sweden', 24)
('Switzerland', 8)
('Taiwan', 1)
('Thailand', 4)
('Turkey', 4)
('USA', 13062)
('Utopia', 1)
Clean up values for the number of students
We need to clean the data for the number of students. Not all of the values are numeric, and many of them contain bits of text in addition to the actual number of the students. To figure out which entries need to be fixed, we need to use a Numeric facet:
df.describe()
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
count | 15534 | 15504 | 13541 | 602 | 15534 | 3166 | 15534 | 15183 | 15532 | 13830 |
unique | 1085 | 1172 | 601 | 63 | 47 | 352 | 448 | 887 | 1104 | 698 |
top | Pennsylvania State University | 1.708E9 | 8864 | 60 | USA | 3373 | 1855 | 800 | 77,179 Total | 800 |
freq | 8960 | 4480 | 8960 | 192 | 13062 | 216 | 8970 | 651 | 896 | 564 |
df.numStudents.unique()
array(['70000', '27393', '21160', '894', '15553', '62801', '65234', nan,
'15424', '4533', '5756', '24424', '2426', '7730', '2987', '10477',
'13773', '14388', '14522', '32472', '40829', '17101', '7497',
'2400', '7306', '17500', '4463', '2200', '6448', '4863', '900+',
'13570', '4927', '32739', '1604', '2499', '13785', '16317', '339',
'1874', '10790', '4494', '4836', '4176', '4352', '30819', '6314',
'26851', '3251', '616', '49589', '50116', '3224', '52405',
'~50,000', '7773', '3785', '3822', '5000', '20712', '21000',
'21048', '7787', '17231', '18502', '6158', '11804', '36041',
'38912', '29390', '1207', '3090', '3249', '24370', '2654', '2815',
'15705', '14575', '1144', '10235', '1015', '3657', '1033', '1062',
'9840', '32762', '14806', '10623', '13438', '18900', '23420',
'3192', '2480', '12125', '4800', '30930', '7848', '5560', '12130',
'25063', '25277', '16567', '12270', '2566', '7881', '1972', '1779',
'2715', '4179', '3200', '971', '7538', '10133', '6214', '14196',
'35271', '11034', '17055', '14153', '998', '4238', '9251', '23500',
'4666', '17502', '11722', '47430', '4600', '9000', '7110', '3120',
'20655', '2766', '29290', '3740', '8394', '600', '3927', '1836',
'7313', '4061', '2073', '2183', '1315', '15484', '3345', '735',
'15872', '11308', '15889', '7885', '1066', '1331', '3574', '6840',
'1737', '11604', '6058', '1917', '5079', '23859', '23930', '2110',
'2640', '14000', '2942', '3483', '4854', '5930', '23175', '23855',
'2355', '9625', '4500', '1000', '779', '1760', '1885', '37641',
'3739', '17000', 'http://www.brocku.ca/athletics/quickfacts.php',
'20762', '21080', '757', '1116', '2000', '9748', '14177', '2242',
'2488', '19000', '2305', '16548', '16809', '38140', '13863', '4900',
'10363', '21016', '27008', '18,234 -', '26,101 -', '36000', '4200',
'21091', '110436', '18396', '18448', '9558', '7.07184E7',
'7.71264E7', '10000', '1650', '5097', '5471', '5008', '7542',
'8800', '1791', '23122', '30328', '6080', '2776', '14235', '2636',
'2744', '4250', '8094', '13600', '34767', '3709', '22125', '24875',
'2225', '9203', '27676', '8955', '25215', '29549', '30000', '11659',
'17', '3168', '17849', '22280', '19020', '800', '2725', '32275',
'33825', '4337', '3974', '12114', '23135', '3000', '3485', '29703',
'8831', '30303', '11476', '11842', '602', '8649', '1703', '3500',
'5201', '3719', '42099', '42595', '5760', '6082', '2175', '2231',
'22479', '4183', '47122', '5790', '4014', '28602', '42513', '1118',
'15405', '18442', '5033', '4600800.0', '10534', '28525', '55014',
'7751', '7755', '58698', '7567', '11201', '2100', '24531', '1500',
'1,500+', '46422', '46423', '754', '17290', '39165', '9300',
'41674', '42716', '42761', '42907', '49020', '12445', '19333',
'22405', '24662', '18572', '11593', '3209', '23162', '23590',
'8742', '5552', '5,552 total', '1415', '16822', '7942', '6957',
'11733', '28311', '5200', '13995', '14070', '30823', '31766',
'10688', '12557', '21225', '25313', '12024', '12683', '2124',
'5500', '5514', '5525', '12725', '9517', '35274', '37132', '450',
'23315', '24735', '10502', '20407', '8034', '1210', '46302',
'13959', '24849', '13490', '15265', '8606', '29462', '7615',
'24170', '8603', '8645', '9135', '8878', '9370', '29616', '22707',
'2334', '28018', '33585', '21210', '3439', '19700', '25890',
'26073', '9555', '13351', '13676', '4278', '35200', '14560', '7943',
'8005', '12714', '12002', '23761', '24977', '12400', '82', '3432',
'24594', '2648', '19890', '10820', '6101', '2565', '29894', '29952',
'5260', '22275', '16000', '20222', '21011', '37426', '19391',
'25700', '26960', '5379', '24593', '14765', '2900',
'faculty to student ratio: 12:1', '9150', '1600', '35204', '4000',
'47966', '1683', '~1,610', '363', '44', '32304', '6945', '51611',
'51721', '51853', '30540', '32500', '13183', '9940', '18475',
'38934', '57409', '24225', '22805', '30461', '44595', '49000',
'4991', '9339', '23085', '23336', '17455', '10383', '12000', '811',
'3246', '21827', '2282', '11646', '15642', '8810', '23600', '24192',
'8298', '23470', '13089', '19415', '19740', '41215', '2836', '3700',
'3770', '3829', '3721', '4525', '5365', '3050', '2345', '7700',
'7257', '2263', '19780', '2589', '2542', '29898', '2120', '2300',
'3600', '4384', '1748', '7928', '7005', '7277', '7740', '2270',
'1130.0', '1841.0', '6919', '47878', '34000', '9106', '8278',
'3378', '7131', '19966', '6964', '5809', '19664', '19993', '3420',
'2658', '2609', '1938', '2442', '42606', '2435', '10074', '20212',
'20,000+', '1460', '6276', '1308', '2719', '40000', '11065', '2660',
'3633', '80022011', '20487', '111329', '852', '131403', '24595',
'28', '1371', '3800', '23883', '3137', '33788', '28394', '30377',
'5.29092E7', '5.49E7', '14706', '720', '2674', '2886', '23103',
'8000', '10482', '6', '1890', '1872', '34255', '53337', '27523',
'18498', '11922', '6398', '15189', '70440', '72254', '23588',
'24378', '8985', '3461', '2', '27209', '22974', '11581', '47800',
'47954', '14769', '15195', '1771', '16355',
'One MEELLLLIOONNN DOLL HAIRS', '19500', '17499', '10894', '33500',
'15319', '3480', '28094', '8846', '6726', '27606', '20939', '20330',
'21535', '6647', '1454', '20643', '21329', '13925', '4496', '17261',
'17333', '21559', '630', '795', '6290', '8524', '33977', '44817',
'95833', '630 Dickinson School of Law', '800 College of Medicine',
'8,524 World Campus', '44,817 University Park', '795 Great Valley',
'95,833 Total', '33,977 Commonwealth Campuses',
'6,290 PA College of Tech', '25045', '29887', '31040', '27816',
'87274', '17999', '17950', '4197033329', '41710', '15649', '11956',
'28766', '28823', '2724', '45963', '978', '11180', '12312', '19721',
'5233', '29689', '4072', '11867', '38010', '18630', '27195',
'15657', '14820', '13381', '13893', '3245', '15064', '15473',
'7521', '5913', '5998', '33602', '31899', '39271', '39922', '33490',
'24125', '28994', '28290', '1115', '17351', '2284', '17533',
'18971', '39697', '39726', '27269', '19379', '13410', '2203',
'19728', '3005', '3045', '3540', '2459', '18004', '45126', '45954',
'2559', '15446', '32611', '56868', '28203', '20956', '3537', '6654',
'285392012', '15920', '34870', '8289', '25469', '10549', '18762',
'14713', '28091', '55115', '32653', '5152', '380', '14754', '7764',
'15951', '16040', '15536', '9799', '9352', '28457'], dtype=object)
float('5.29092E7')
52909200.0
df.numStudents = [str(i).replace('+', '').replace('~', '') for i in df.numStudents]
df.numStudents = [str(i).replace(',', '').replace('-', '').strip() for i in df.numStudents]
df.numStudents = [i.split(' ')[0] for i in df.numStudents]
# https://stackoverflow.com/questions/4138202/using-isdigit-for-floats
def isDigit(x):
try:
float(x)
return True
except ValueError:
return False
isDigit('1130.0')
True
str.isdigit('1130.0')
False
df['numStudents'] = df['numStudents'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
odds = {}
for i in df['numStudents']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
len(df)
15534
odds
{}
df.numStudents.iloc[0]
70000.0
df.numStudents.unique()[:10]
array([ 70000., 27393., 21160., 894., 15553., 62801., 65234.,
nan, 15424., 4533.])
df.numStudents.max(), df.numStudents.min()
(4197033329.0, 2.0)
# df.numStudents.astype('float')
# pd.to_numeric(df.numStudents)#, errors='ignore')
df.describe()
numStudents | |
---|---|
count | 1.380000e+04 |
mean | 3.934969e+05 |
std | 3.585477e+07 |
min | 2.000000e+00 |
25% | 1.065000e+03 |
50% | 1.068800e+04 |
75% | 3.397700e+04 |
max | 4.197033e+09 |
Clean up values for the endowment
First remove the numeric facet for numStudents and create a new numeric facet for endowment. Select only the non-numeric values, as was done for the number of students. Already we see issues like “US$1.3 billion” and “US $186 million”
df['endowment']
0 15
2 121
3 4700000
4 16586100
5 16586100
6 40200750
7 40200750
8 40200750
9 40200750
10 40200750
11 40200750
12 40200750
13 40200750
14 40200750
15 40200750
16 40200750
17 40200750
18 562000000
19 562000000
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 1.3E7
28 1.0E7
29 3.5E8
30 4.5E8
31 0.0
32 0.0
...
74996 $2,17 billion
74997 $2,17 billion
74998 $2,17 billion
74999 $2,17 billion
75000 $2,17 billion
75001 $2,17 billion
75002 $2,17 billion
75003 $2,17 billion
75004 US $401.2 million
75006 US $213.2 million
75007 US $381 million
75008 US $64.5 million
75009 US $64.5 million
75010 US $64.5 million
75011 US $716.8 million
75012 US $716.8 million
75013 US $716.8 million
75014 US $716.8 million
75015 US $716.8 million
75016 US $716.8 million
75017 US $716.8 million
75018 US $716.8 million
75019 US $716.8 million
75020 US $716.8 million
75021 US $716.8 million
75022 US $716.8 million
75023 US $716.8 million
75024 US $716.8 million
75025 US $716.8 million
75026 US $716.8 million
Name: endowment, dtype: object
np.sum(odds.values())
0.0
odds = {}
for i in df['endowment']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()[:3]
[('US$226 million', 1),
('US $6.44 billion', 16),
('1,5 billion \xe2\x82\xac', 1)]
np.float('1E6')
1000000.0
df.endowment = [str(i).replace('US $', '').replace('US$', '') for i in df.endowment]
df.endowment = [str(i).replace('USD$', '').replace('USD $', '') for i in df.endowment]
df.endowment = [str(i).replace('U.S. $', '').replace(',', '').strip() for i in df.endowment]
endowment = []
for i in df.endowment:
if i.__contains__('$'):
endowment.append(str(i).split('$')[1])
else:
endowment.append(i)
df.endowment = endowment
df.endowment = [str(i).replace(' million', 'E6').replace(' billion', 'E9').strip() for i in df.endowment]
df.endowment = [str(i).replace('million', 'E6').replace('billion', 'E9').strip() for i in df.endowment]
df.endowment = [str(i).replace(' Million', 'E6').replace(' Billion', 'E9').strip() for i in df.endowment]
df.endowment = [str(i).split(' ')[0] for i in df.endowment]
df.endowment = [str(i).replace('M', 'E6').strip() for i in df.endowment]
df.endowment = [str(i).replace(';', '').replace('+', '').strip() for i in df.endowment]
# df.endowment = [str(i).split('xbf')[1] for i in df.endowment]
# df.endowment = [str(i).split('xb')[1] for i in df.endowment]
# df.endowment = [str(i).split('xa')[1] for i in df.endowment]
After most of this has been cleaned up, select the non-numeric values, and delete them, just as was done for the numStudents.
df['endowment'] = df['endowment'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
df.describe()
endowment | numStudents | |
---|---|---|
count | 1.490400e+04 | 1.380000e+04 |
mean | 2.149103e+09 | 3.934969e+05 |
std | 1.927573e+10 | 3.585477e+07 |
min | 0.000000e+00 | 2.000000e+00 |
25% | 2.430000e+08 | 1.065000e+03 |
50% | 1.546000e+09 | 1.068800e+04 |
75% | 1.708000e+09 | 3.397700e+04 |
max | 1.545840e+12 | 4.197033e+09 |
- 巴西雷亚尔的符号 R$
- CANADA DOLLARS C$
- 澳元的货币符号 A$
Both “million” and “Million” are in the values, so it’s useful to convert all the values to lowercase instead of cleaning this up twice.
df.head()
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Paris Universitas | 15.0 | 5500 | 8000 | France | NaN | 2005 | NaN | 25000 | 70000.0 |
2 | Lumi%C3%A8re University Lyon 2 | 121.0 | NaN | 1355 | France | NaN | 1835 | 7046 | 14851 | 27393.0 |
3 | Confederation College | 4700000.0 | NaN | NaN | Canada | NaN | 1967 | not available | pre-university students; technical | 21160.0 |
4 | Rocky Mountain College | 16586100.0 | NaN | NaN | USA | NaN | 1878 | 66 | 878 | 894.0 |
5 | Rocky Mountain College | 16586100.0 | NaN | NaN | USA | NaN | 1878 | 66 | 878 | 894.0 |
Others
numFaculty, numDoctoral, numStaff, numPostgrad, numUndergrad, established
numFaculty
df.numFaculty = [str(i).replace('Total: ', '').replace(',', '') for i in df.numFaculty]
df.numFaculty = [str(i).replace('>', '').replace('~', '') for i in df.numFaculty]
df.numFaculty = [str(i).split(' ')[0] for i in df.numFaculty]
df['numFaculty'] = df['numFaculty'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
odds = {}
for i in df['numFaculty']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()
[]
numDoctoral
odds = {}
for i in df['numDoctoral']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()
[('available', 1),
('N\\A', 27),
('not available', 40),
('N.A', 16),
('~25', 4)]
df.numDoctoral = [str(i).replace('~', '') for i in df.numFaculty]
df['numDoctoral'] = df['numDoctoral'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
numStaff
odds = {}
for i in df['numStaff']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()
[('incl. 1,403 academics and 150 researchers', 2),
('Total: 873', 48),
('Approximately 7,170', 1),
('approximately 30', 1),
('167 full-time academic faculty, \n596 non-academic staff', 4),
('Around 120', 27),
('~500', 4),
('190 researchers & lecturers, 153 administrative & technical staff', 2),
('Approximately 500', 1),
('4,478 employees', 64),
('Around 5,000', 48),
('>21,000', 2),
('?', 1),
('~100', 4),
('full-time, part-time', 1),
('Total: 1,608', 2),
('Total: 1,600', 16),
('Full-time: 1,469', 2),
('156 full-time; 229 part-time', 1),
('Part-time: 139', 2),
('960 full-time, 460 part-time', 16),
('appx. 20', 8),
('~', 1)]
df.numStaff = [str(i).replace('Total: ', '').replace(',', '') for i in df.numStaff]
df.numStaff = [str(i).replace('>', '').replace('~', '') for i in df.numStaff]
df.numStaff = [str(i).replace('Around ', '').replace('appx. ', '') for i in df.numStaff]
df.numStaff = [str(i).replace(' employees', '').replace('Approximately ', '') for i in df.numStaff]
df.numStaff = [str(i).replace('Full-time: ', '').replace('Part-time: ', '') for i in df.numStaff]
df.numStaff = [str(i).replace('approximately ', '') for i in df.numStaff]
df.numStaff = [str(i).split(' ')[0] for i in df.numStaff]
df['numStaff'] = df['numStaff'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
numPostgrad
odds = {}
for i in df['numPostgrad']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()
[('268 full-time MBA', 24),
('approx. 300', 1),
('~650', 4),
('n/a', 1),
('Aprx. 2,000', 2),
('630 Dickinson School of Law', 640),
('120+', 8),
('795 Great Valley', 640),
('Does not offer postgraduate studies', 1),
('~3,914', 1),
('4,937 full-time, 1,446 part-time', 2),
('800 College of Medicine', 640),
('not available', 1),
('~3,200', 2),
('Masters of Business Administration in Community Economic Development', 20),
('Some postdoctoral students and visiting scholars', 1),
('postgraduate level degree available', 1),
('~160', 3),
('142 Ph.D. students', 2),
('1,682 Commonwealth Campuses', 640),
('available', 1),
('71 MLHR', 24),
('14,020 Total', 640),
('Approximately 1000', 1),
('Approx. 600', 1),
('----', 4),
('~500', 4),
('~5,500', 2),
('~60', 4),
('325 part-time MBA', 24),
('over 1,300', 1),
('N\\A', 9),
('none', 1),
('6,223 University Park', 640),
('ca. 3,230', 2),
('95 MAcc', 24),
('9,957 \xe2\x80\x93 Vancouver', 24),
('531 \xe2\x80\x93 Okanagan', 24),
('3,890 World Campus', 640),
('TBD', 2)]
df.numPostgrad = [str(i).replace('~', '').replace(',', '') for i in df.numPostgrad]
df.numPostgrad = [str(i).replace('approx. ', '').replace('Approx. ', '') for i in df.numPostgrad]
df.numPostgrad = [str(i).replace('Approximately ', '').replace('Aprx. ', '') for i in df.numPostgrad]
df.numPostgrad = [str(i).replace('+', '').replace('over', '') for i in df.numPostgrad]
df.numPostgrad = [str(i).split(' ')[0] for i in df.numPostgrad]
df['numPostgrad'] = df['numPostgrad'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
numUndergrad
odds = {}
for i in df['numUndergrad']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()
[('4,747 full-time', 96),
('over 2,000', 1),
('Approximately 10,000', 1),
('2,000+', 8),
('approximately 1,500', 1),
('n/a', 1),
('Aprx. 2,000', 2),
('School of Liberal Arts; School of Science & Technology; School of Graduate & Professional Studies; Shannon School of Business',
20),
('approx. 2,000', 1),
('36,518 \xe2\x80\x93 Vancouver', 24),
('~25,000', 2),
('65 per year', 1),
('77,179 Total', 896),
('approx. 1,200', 8),
('Approx. 12,000', 2),
('38,594 University Park', 896),
('32,295 Commonwealth Campuses', 896),
('Approximately 2,300', 1),
('7,004 \xe2\x80\x93 Okanagan', 24),
('~400', 1),
('pre-university students; technical', 1),
('Around 10,000', 2),
('available', 3),
('21,726 -', 24),
('None', 1),
('900+', 1),
('475 Resident Undergraduates', 1),
('ca. 3,046', 2),
('Approx. 7,100', 1),
('6,290 PA College of Tech', 896),
('4,634 World Campus', 896),
('28,477 full-time, 2,102 part-time', 2),
('none', 1),
('approx. 2,150', 4),
('Approximately 730', 1),
('~13,000', 2),
('~2,000', 4),
('diploma, degree available', 1),
('Approx. 13,000', 1),
('2,000 traditional', 1),
('~1,560', 1),
('???', 1),
('~1,550', 4)]
df.numUndergrad = [str(i).replace('~', '').replace(',', '') for i in df.numUndergrad]
df.numUndergrad = [str(i).replace('approx. ', '').replace('Approx. ', '') for i in df.numUndergrad]
df.numUndergrad = [str(i).replace('Approximately ', '').replace('Aprx. ', '') for i in df.numUndergrad]
df.numUndergrad = [str(i).replace('approximately ', '').replace('Around ', '') for i in df.numUndergrad]
df.numUndergrad = [str(i).replace('+', '').replace('over', '') for i in df.numUndergrad]
df.numUndergrad = [str(i).split(' ')[0] for i in df.numUndergrad]
df['numUndergrad'] = df['numUndergrad'].apply(lambda x: np.float(x)
if isDigit(x)
else np.nan)
established
df.established
0 2005
2 1835
3 1967
4 1878
5 1878
6 1901
7 1901
8 1947
9 1947
10 1963
11 1963
12 1963 - university status
13 1963 - university status
14 1947 - four-year college
15 1947 - four-year college
16 1901 -
17 1901 -
18 1924
19 1924
22 1970
23 1918-05-01
24 1925
25 Established 1985
26 Chartered 1984
27 1994
28 1947
29 1948
30 1878
31 2004-09-30
32 2004-09-30
...
74996 1848
74997 1848
74998 1848
74999 1848
75000 1848
75001 1848
75002 1848
75003 1848
75004 1881-08-28
75006 1855-10-15
75007 1911
75008 1964
75009 1964
75010 1964
75011 1851
75012 1851
75013 1851
75014 1851
75015 1851
75016 1851
75017 1851
75018 1851
75019 1851
75020 1851
75021 1851
75022 1851
75023 1851
75024 1851
75025 1851
75026 1851
Name: established, dtype: object
odds = {}
for i in df['established']:
if not isDigit(i):
try:
odds[i] += 1
except:
odds[i] = 1
odds.items()[:5]
[]
import re
def getYear(s):
try:
match = re.match(r'.*([1-3][0-9]{3})', s)
return np.int(match.group(1))
except:
return np.nan
df.established = [getYear(i) for i in df.established]
df.describe()
endowment | numFaculty | numDoctoral | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|
count | 1.490400e+04 | 1.348700e+04 | 1.348700e+04 | 3.161000e+03 | 15186.000000 | 1.513800e+04 | 1.550000e+04 | 1.380000e+04 |
mean | 2.149103e+09 | 8.695432e+03 | 8.695432e+03 | 1.057127e+04 | 1865.196036 | 1.104538e+04 | 2.762582e+05 | 3.934969e+05 |
std | 1.927573e+10 | 1.286384e+05 | 1.286384e+05 | 2.483439e+05 | 65.865596 | 3.622826e+05 | 2.581618e+07 | 3.585477e+07 |
min | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1066.000000 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 |
25% | 2.430000e+08 | 1.953000e+03 | 1.953000e+03 | 1.592000e+03 | 1855.000000 | 8.000000e+02 | 6.290000e+03 | 1.065000e+03 |
50% | 1.546000e+09 | 8.864000e+03 | 8.864000e+03 | 2.799000e+03 | 1855.000000 | 3.067000e+03 | 1.782100e+04 | 1.068800e+04 |
75% | 1.708000e+09 | 8.864000e+03 | 8.864000e+03 | 5.000000e+03 | 1881.000000 | 6.223000e+03 | 3.667500e+04 | 3.397700e+04 |
max | 1.545840e+12 | 1.407201e+07 | 1.407201e+07 | 1.280201e+07 | 2012.000000 | 2.998201e+07 | 3.198523e+09 | 4.197033e+09 |
University
df['university'].unique(), df['university'].unique().size
(array(['Paris Universitas', 'Lumi%C3%A8re University Lyon 2',
'Confederation College', ..., 'University of San Francisco',
'Loyola Marymount University', 'Nova Southeastern University'], dtype=object),
1085)
# university_df = df.groupby('university').size()
# for k in university_df.index:
# print(k, university_df[k])
df.university = [str(i).replace('%2C', '').replace(',', '') for i in df.university]
df.university = [str(i).replace('%C3%89', 'e').replace('%C3%A9', 'e') for i in df.university]
df.university = [str(i).replace('%E2%80%93', ' ').replace('%22', '') for i in df.university]
df.university = [str(i).replace('%C3%A3', 'a').replace('%C3%A7', 'c') for i in df.university]
df.university = [str(i).replace('%C3%AA', 'e').replace('%C3%BA', 'u') for i in df.university]
df.university = [str(i).replace('%C3%A1', 'a').replace('%C3%B3', 'o') for i in df.university]
df.university = [str(i).replace('%CA%BB', "'").replace('%C4%81', 'a') for i in df.university]
df.head()
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Paris Universitas | 15.0 | 5500.0 | 5500.0 | France | NaN | 2005.0 | NaN | 25000.0 | 70000.0 |
2 | Lumi%C3%A8re University Lyon 2 | 121.0 | NaN | NaN | France | NaN | 1835.0 | 7046.0 | 14851.0 | 27393.0 |
3 | Confederation College | 4700000.0 | NaN | NaN | Canada | NaN | 1967.0 | NaN | NaN | 21160.0 |
4 | Rocky Mountain College | 16586100.0 | NaN | NaN | USA | NaN | 1878.0 | 66.0 | 878.0 | 894.0 |
5 | Rocky Mountain College | 16586100.0 | NaN | NaN | USA | NaN | 1878.0 | 66.0 | 878.0 | 894.0 |
len(df)
15534
Deduplicate entries 2.0
df = df.drop_duplicates()
len(df)
4919
df.head()
university | endowment | numFaculty | numDoctoral | country | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Paris Universitas | 15.0 | 5500.0 | 5500.0 | France | NaN | 2005.0 | NaN | 25000.0 | 70000.0 |
2 | Lumi%C3%A8re University Lyon 2 | 121.0 | NaN | NaN | France | NaN | 1835.0 | 7046.0 | 14851.0 | 27393.0 |
3 | Confederation College | 4700000.0 | NaN | NaN | Canada | NaN | 1967.0 | NaN | NaN | 21160.0 |
4 | Rocky Mountain College | 16586100.0 | NaN | NaN | USA | NaN | 1878.0 | 66.0 | 878.0 | 894.0 |
6 | Idaho State University | 40200750.0 | 838.0 | 838.0 | USA | 1269.0 | 1901.0 | 2661.0 | 12892.0 | 15553.0 |
df.describe()
endowment | numFaculty | numDoctoral | numStaff | established | numPostgrad | numUndergrad | numStudents | |
---|---|---|---|---|---|---|---|---|
count | 4.585000e+03 | 3.650000e+03 | 3.650000e+03 | 2.159000e+03 | 4677.000000 | 4.568000e+03 | 4.887000e+03 | 3.734000e+03 |
mean | 3.128695e+09 | 1.101300e+04 | 1.101300e+04 | 1.302496e+04 | 1879.911268 | 2.654159e+04 | 8.144919e+05 | 1.390889e+06 |
std | 3.311783e+10 | 2.472490e+05 | 2.472490e+05 | 3.004688e+05 | 95.221953 | 6.592566e+05 | 4.597518e+07 | 6.892540e+07 |
min | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1066.000000 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 |
25% | 2.900000e+07 | 5.315000e+02 | 5.315000e+02 | 1.178000e+03 | 1855.000000 | 1.305000e+03 | 4.634000e+03 | 7.542000e+03 |
50% | 1.892540e+08 | 1.099000e+03 | 1.099000e+03 | 2.211000e+03 | 1887.000000 | 3.364000e+03 | 1.348000e+04 | 1.933300e+04 |
75% | 1.295000e+09 | 3.694000e+03 | 3.694000e+03 | 4.478000e+03 | 1924.000000 | 6.614250e+03 | 2.332800e+04 | 2.570000e+04 |
max | 1.545840e+12 | 1.407201e+07 | 1.407201e+07 | 1.280201e+07 | 2012.000000 | 2.998201e+07 | 3.198523e+09 | 4.197033e+09 |
plt.matshow(df.corr())
plt.show()
def correlation_matrix(df):
from matplotlib import pyplot as plt
from matplotlib import cm as cm
fig = plt.figure(figsize=(7, 7),facecolor='white')
ax1 = fig.add_subplot(111)
cmap = cm.get_cmap('jet', 30)
cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
ax1.grid(True)
plt.title('Correlation')
labels=['','endowment','numFaculty','numDoctoral','numStaff',\
'established','numPostgrad','numUndergrad','numStudents']
ax1.set_xticklabels(labels,fontsize=16, rotation = 90)
ax1.set_yticklabels(labels,fontsize=16)
# Add colorbar, make sure to specify tick locations to match desired ticklabels
fig.colorbar(cax)
plt.show()
correlation_matrix(df)
plt.plot(df.numStaff, df.numPostgrad, 'o')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('$Staff$', fontsize = 20)
plt.ylabel('$Post\;Graduates$', fontsize = 20)
plt.show()
Write data
df.to_csv(‘universityDataClean.csv’, index = False)
Jupyter Notebook
本文所使用到的所有代码见 nbviewer。
Leave a Comment