The project goal
was to acquire the possible addresses of the Ph.D. students who graduated
from School of Library and Information Science (SLIS), Indiana University,
Bloomington.
The internet is
one massive repository of information. It is an in-house for people's
work in the form of publications, projects, homepages and others. Several
search engines are currently available to help us search for information.
In our project, we used "Google" search engine to aid us in
accomplishing our goals.
Methodology:
The algorithm to meet the goal was
wriiten using Perl. The pseudo-code of the following is discussed below:
Input
a) PhD graduates names list (First
Name # Last Name)
b) US States Names (Abbrev. # Name)
Parsing approach
+ Google search engine supports the search query only
if it appears from the google search webpage. To support this feature
the code encompasses the refereal site as google and
makes it seem that the query is coming from the google web-page.
+ Search query terms are persons first name and last name.
+ To acquire the information, search was performed on the following:
a) Google phone directory:
+ Initial search was
performed in the Google phone directory where people's addressess along
with their phone numbers are listed.
+ Each person's name
was queried against each US states and the search hit pages were filtered
for required data, which comprised of the persons
name, their address and the contact no
b)
Homepage Search:
+ Person's full name
was used to query google search engine.
+ Search result hit was taken
as the homepage
+ 2 conditons were checked:
i) w/ Frames ii) w/o Frames
+ If frames exists,
then dir url was extracted and existence of the webpage was confirmed
using multiple combinations (eg: resume.html, resume.htm)
+ If frames donot
exists then, check for the existence of the page directly.
+ Finally the address information
was extracted from the homepage
Output
a) Search results are stored in individual persons data
files named using a combination of first and last name. The data obtained
by each approach is clearly indicated in the
file (sample persons result file).
b) A webpage using the script,
which shows the results of all the people's names that were used in
the search.