1. Paepcke, A., GarcĂa-Molina, H., & Wesley, R. (2005). Dewey Meets Turing Librarians, Computer Scientists, and the Digital Libraries Initiative.
D-Lib Magazine, 11. Retrieved from
http://www.dlib.org/dlib/july05/paepcke/07paepcke.html
NSF --> Digital Libraries Initiative (1994)
- collaboration librarians x CSists
- research x daily affairs, aka theory x practice
- shared values
- need to share w/ wider community
- linkage of reliable info not just for "info pros" but also CS
- Google one of many results
- how to access, share funding?
- misconceptions from both parties
- "hubs" as new framework for collections online
- connections b/w librarians <> scholarly authors
2. Lynch, C. A. (2003). Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age.
Association of Research Libraries,
26. Retrieved from
http://www.arl.org/storage/documents/publications/arl-br-226.pdf
Institutional repositories (2002)
- definition
- provides services to uni community for mgmt and dissemination of digital mats.
- work by both fac & students
- research & teaching
- stewardship of such mats.
- supported by diff. techs.
- ++ accountability for unis
- ++ active role in scholarly publishing
- forging more strategic, mutually beneficial alliances
New patterns in access/dissemination
- decrease in online storage costs
- standards for metadata --> interop.
MIT DSpace x HP (2003)
- model for other reps both in the U.S. and internationally
- open-source software
- esp. important for institutions w/ significantly lower endowments/resources
Strategic importance
- near-term & long-term preservation of scholarly works, esp. by faculty
- supplementary materials
- preprints? "first access"
- also affiliation w/ institution
- what is worth collecting?
- encourage faculty to use institution resources
- complement to disciplinary repositories
Potential dangers
- institutional control over intell. property
- centralization (inst.) v. decentralization (discipline/dept)
- risk of inappropriate policy constraints?
- too fashionable?
- hasty implementation w/o judging merits or sustained commitment?
Networked info standards and infrastructure
- preservable formats
- identifiers
- persistent and consistent reference to mats.
- rights doc. and mgmt
- again, metadata
- but also controlled vocab (?)
3. Hawking, D. (2006). How Things Work: Web Search Engines: Parts 1 and 2.
IEEE Computer. Retrieved from
http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
Data processing
- tools and interfaces have many of same data structures and algorithms in common
- search engines can't/shouldn't index all pgs
- b/c no. of pgs is infinite
- more useful to
- reject "low-value content"
- ignore huge vols. of accessible data
Problems and techniques
- multiple locations for data centers
- helps tolerate redundancy and faults
- PC types depends on factors like price, speed, memory, physical size, etc.
- clusters can target specialized functions
- ex. crawling, indexing, replication
Crawling algorithms
- queue of unvisited URLs
- started by 1 or more "seed" URLs, then HTTP request
- huge data structure required
- real crawlers
- different speeds
- risk of server overload
- only 1 req/server
- "politeness" delay b/w requests
- excluded content
- check site's robots.txt file
- to see whether parts or all of site should be crawled
- duplicate content
- unrecognized duplicates could be links to other duplicates
- early detection necessary
- continuous crawling
- full crawls at fixed intervals might slow processing
- instead install priority queue
- spam rejection
Indexing algorithms
- use inverted files for rapid indexing
- 2 phases
- scan text of each doc
- inversion (?)
Real indexers
- store addt'l info in postings
- ex. term frequency, positions
- scaling up
- term lookup
- compression for key structures
- precomputing for common phrases
- indexing anchor text w/ target & source (?)
- popularity score of pages
- derived from frequency of incoming links
- ex. PageRank
- query-independent score
- internal ranking
- ++ score, ++ retrieval probability
Query-processing algorithms
- most common type of query
- return docs containing all query words
Real processors
- simple-query processor usu. = poor results
- increase in quality
- scans to end and sorts lists by relevance
- but too computationally time-consuming, expensive
Increasing speed
- skipping
- early termination
- can stop processing after short scan
- better assignment of doc numbers (??)
- caching
4. Shreeves S. L., Habing, T. G., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI Protocol for Metadata Harvesting.
Library Trends, 53. Retrieved from
http://hdl.handle.net/2142/1754
Open Archives Initiative Protocol for Metadata Harvesting (2001)
- scalable solution for community metadata needs
- implementation nonspecific
- facilitate use in wide variety of institutions and domains
- min. use: DC schema
- access to "invisible web" + aggregate sources from diff collections
- 2 "entities" who use protocol
- data providers, aka repositories
- service providers, aka harvesters
- can build value-added services
Current trends and developments
- user group-specific service providers
- diff comms develop diff standards in addition to protocol
- Open Language Archives Community
- Sheet Music Consortium
- particular problem b/c of sheet music, cover art, lyrics, etc.
- allows users to annotate metadata
- National Science Dig Lib
- OAI protocol primary means
- build + aggregate collections and services/infrastructure to support activities
Shortcomings of existing registries
- usu. very sparse recs about indiv. reps
- no search mechanism
- ltd browsing
- few registers have complete list of all available reps
Developing experimental OAI registry (UIUC)
- completeness
- inventory of existing registries
- following and exploring links
- search Google for OAI reps
- discoverability
- allow for diff views w/o any manual cataloging of OAI reps
- automation of data harvesting and indexing
- machine processing
- turn registry into OAI rep
Future work
- for better search and discovery, enhance collection-level desc
- increase in automated maintenance of registry
- increase in automated discovery of other registries
- delegate creation and maintenance of virtual collections, incl. metadata
- improve view of search results (contextualization)
ERRoL resolution (Extensible Repository Resource Locators)
- "cool URLs" (Berners-Lee) to content and services linked to info in OAI rep
- OAI-id for item
Challenges
- data provider implementations
- many potentially useful features underutilized
- metadata
- ways of using encoding standards differ
- leads to diff relevance for users
- ++ formats, ++ complex metadata
- lack of communication b/w service and data providers
Future directions
- development of best practices
- Static Repository Gateway (Los Alamos Natl Lab)
- low technical entry barrier
- mod_ai project
- accessible content from Apache open-source servers
- OAI rights
- means of structured lang w/in protocol
- controlled vocabs
- gateway to ERRoL service