Friday, November 14, 2014

Week 11: Digital library and web search

1. Paepcke, A., García-Molina, H., & Wesley, R. (2005). Dewey Meets Turing Librarians, Computer Scientists, and the Digital Libraries Initiative. D-Lib Magazine, 11. Retrieved from http://www.dlib.org/dlib/july05/paepcke/07paepcke.html


NSF --> Digital Libraries Initiative (1994)
  • collaboration librarians x CSists
    • research x daily affairs, aka theory x practice
    • shared values
      • need to share w/ wider community
      • linkage of reliable info not just for "info pros" but also CS
  • Google one of many results 
  • how to access, share funding?
    • misconceptions from both parties
  • "hubs" as new framework for collections online
  • connections b/w librarians <> scholarly authors


2. Lynch, C. A. (2003). Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age. Association of Research Libraries, 26. Retrieved from http://www.arl.org/storage/documents/publications/arl-br-226.pdf

Institutional repositories (2002)
  • definition
    • provides services to uni community for mgmt and dissemination of digital mats.
      • work by both fac & students
      • research & teaching
    • stewardship of such mats.
      • also data
    • supported by diff. techs.
  • ++ accountability for unis
    • ++ active role in scholarly publishing
    • forging more strategic, mutually beneficial alliances

New patterns in access/dissemination
  • decrease in online storage costs
  • standards for metadata --> interop.

MIT DSpace x HP (2003)
  • model for other reps both in the U.S. and internationally
  • open-source software
    • esp. important for institutions w/ significantly lower endowments/resources

Strategic importance
  • near-term & long-term preservation of scholarly works, esp. by faculty
  • supplementary materials
    • preprints? "first access"
  • also affiliation w/ institution
  • what is worth collecting?
  • encourage faculty to use institution resources
    • complement to disciplinary repositories

Potential dangers
  • institutional control over intell. property
  • centralization (inst.) v. decentralization (discipline/dept)
    • risk of inappropriate policy constraints?
  • too fashionable?
    • hasty implementation w/o judging merits or sustained commitment?

Networked info standards and infrastructure
  • preservable formats
  • identifiers
    • persistent and consistent reference to mats.
  • rights doc. and mgmt
    • again, metadata
    • but also controlled vocab (?) 

3. Hawking, D. (2006). How Things Work: Web Search Engines: Parts 1 and 2. IEEE Computer. Retrieved from http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf

Data processing
  • tools and interfaces have many of same data structures and algorithms in common
  • search engines can't/shouldn't index all pgs
    • b/c no. of pgs is infinite
  • more useful to
    • reject "low-value content"
    • ignore huge vols. of accessible data

Problems and techniques
  • multiple locations for data centers
    • helps tolerate redundancy and faults
    • PC types depends on factors like price, speed, memory, physical size, etc.
    • clusters can target specialized functions
      • ex. crawling, indexing, replication

Crawling algorithms
  • queue of unvisited URLs
    • started by 1 or more "seed" URLs, then HTTP request
    • huge data structure required
  •  real crawlers
    • different speeds
    • risk of server overload 
      • only 1 req/server
      • "politeness" delay b/w requests
  • excluded content
    • check site's robots.txt file 
    • to see whether parts or all of site should be crawled
  • duplicate content
    • unrecognized duplicates could be links to other duplicates
    • early detection necessary
  • continuous crawling
    • full crawls at fixed intervals might slow processing
    • instead install priority queue
  • spam rejection

Indexing algorithms
  • use inverted files for rapid indexing
  • 2 phases
    • scan text of each doc
    • inversion (?)

Real indexers
  • store addt'l info in postings
    • ex. term frequency, positions
  •  scaling up
    • doc partitioning
  • term lookup
  • compression for key structures
  • precomputing for common phrases
  • indexing anchor text w/ target & source (?)
    • useful for descriptions
  • popularity score of pages
    • derived from frequency of incoming links
    • ex. PageRank
  • query-independent score
    • internal ranking
    • ++ score, ++ retrieval probability

Query-processing algorithms
  • most common type of query
    • avg length 2.3 words
  • return docs containing all query words

 Real processors

  • simple-query processor usu. = poor results
  • increase in quality
    • scans to end and sorts lists by relevance
    • but too computationally time-consuming, expensive

 Increasing speed
  • skipping
  • early termination
    • can stop processing after short scan
  • better assignment of doc numbers (??)
  • caching


4. Shreeves S. L., Habing, T. G., Hagedorn, K.,  & Young, J. A. (2005). Current developments and future trends for the OAI Protocol for Metadata Harvesting. Library Trends, 53. Retrieved from http://hdl.handle.net/2142/1754

Open Archives Initiative Protocol for Metadata Harvesting (2001)
  • scalable solution for community metadata needs
  • implementation nonspecific
    • facilitate use in wide variety of institutions and domains
  • min. use: DC schema
    • other schemas possible
  • access to "invisible web" + aggregate sources from diff collections
  • 2 "entities" who use protocol
    • data providers, aka repositories 
    • service providers, aka harvesters
      • can build value-added services

Current trends and developments
  • user group-specific service providers
  • diff comms develop diff standards in addition to protocol
  • Open Language Archives Community
    • language resources
  • Sheet Music Consortium
    • particular problem b/c of sheet music, cover art, lyrics, etc.
    • allows users to annotate metadata
  • National Science Dig Lib
    • OAI protocol primary means
    • build + aggregate collections and services/infrastructure to support activities 
 Shortcomings of existing registries
  • usu. very sparse recs about indiv. reps
  • no search mechanism
  • ltd browsing
  • few registers have complete list of all available reps

Developing experimental OAI registry (UIUC)
  • completeness
    • inventory of existing registries
    • following and exploring links
    • search Google for OAI reps
  • discoverability
    • allow for diff views w/o any manual cataloging of OAI reps
    • automation of data harvesting and indexing
  • machine processing
    • turn registry into OAI rep

Future work
  • for better search and discovery, enhance collection-level desc
  • increase in automated maintenance of registry
  • increase in automated discovery of other registries
  • delegate creation and maintenance of virtual collections, incl. metadata
  • improve view of search results (contextualization)

ERRoL resolution (Extensible Repository Resource Locators)
  • "cool URLs" (Berners-Lee) to content and services linked to info in OAI rep
  • OAI-id for item 

Challenges
  • data provider implementations
    • many potentially useful features underutilized
  • metadata
    • ways of using encoding standards differ
    • leads to diff relevance for users
    • ++ formats, ++ complex metadata
  • lack of communication b/w service and data providers

Future directions
  • development of best practices
  • Static Repository Gateway (Los Alamos Natl Lab)
    • low technical entry barrier
  • mod_ai project
    • accessible content from Apache open-source servers
  • OAI rights
    • means of structured lang w/in protocol
  • controlled vocabs
  • gateway to ERRoL service

No comments:

Post a Comment