Meta, meet Data: Week 11: Digital library and web search

1. Paepcke, A., García-Molina, H., & Wesley, R. (2005). Dewey Meets Turing Librarians, Computer Scientists, and the Digital Libraries Initiative. D-Lib Magazine, 11. Retrieved from http://www.dlib.org/dlib/july05/paepcke/07paepcke.html

NSF --> Digital Libraries Initiative (1994)

collaboration librarians x CSists

research x daily affairs, aka theory x practice
shared values

need to share w/ wider community
linkage of reliable info not just for "info pros" but also CS

Google one of many results
how to access, share funding?

misconceptions from both parties

"hubs" as new framework for collections online
connections b/w librarians <> scholarly authors

2. Lynch, C. A. (2003). Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age. Association of Research Libraries, 26. Retrieved from http://www.arl.org/storage/documents/publications/arl-br-226.pdf

Institutional repositories (2002)

definition

provides services to uni community for mgmt and dissemination of digital mats.

work by both fac & students
research & teaching

stewardship of such mats.

also data

supported by diff. techs.

++ accountability for unis

++ active role in scholarly publishing
forging more strategic, mutually beneficial alliances

New patterns in access/dissemination

decrease in online storage costs
standards for metadata --> interop.

MIT DSpace x HP (2003)

model for other reps both in the U.S. and internationally
open-source software

esp. important for institutions w/ significantly lower endowments/resources

Strategic importance

near-term & long-term preservation of scholarly works, esp. by faculty
supplementary materials

preprints? "first access"

also affiliation w/ institution
what is worth collecting?
encourage faculty to use institution resources

complement to disciplinary repositories

Potential dangers

institutional control over intell. property
centralization (inst.) v. decentralization (discipline/dept)

risk of inappropriate policy constraints?

too fashionable?

hasty implementation w/o judging merits or sustained commitment?

Networked info standards and infrastructure

preservable formats
identifiers

persistent and consistent reference to mats.

rights doc. and mgmt

again, metadata
but also controlled vocab (?)

3. Hawking, D. (2006). How Things Work: Web Search Engines: Parts 1 and 2. IEEE Computer. Retrieved from http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf

Data processing

tools and interfaces have many of same data structures and algorithms in common
search engines can't/shouldn't index all pgs

b/c no. of pgs is infinite

more useful to

reject "low-value content"
ignore huge vols. of accessible data

Problems and techniques

multiple locations for data centers

helps tolerate redundancy and faults
PC types depends on factors like price, speed, memory, physical size, etc.
clusters can target specialized functions

ex. crawling, indexing, replication

Crawling algorithms

queue of unvisited URLs

started by 1 or more "seed" URLs, then HTTP request
huge data structure required

real crawlers

different speeds
risk of server overload

only 1 req/server
"politeness" delay b/w requests

excluded content

check site's robots.txt file
to see whether parts or all of site should be crawled

duplicate content

unrecognized duplicates could be links to other duplicates
early detection necessary

continuous crawling

full crawls at fixed intervals might slow processing
instead install priority queue

spam rejection

Indexing algorithms

use inverted files for rapid indexing
2 phases

scan text of each doc
inversion (?)

Real indexers

store addt'l info in postings

ex. term frequency, positions

scaling up

doc partitioning

term lookup
compression for key structures
precomputing for common phrases
indexing anchor text w/ target & source (?)

useful for descriptions

popularity score of pages

derived from frequency of incoming links
ex. PageRank

query-independent score

internal ranking
++ score, ++ retrieval probability

Query-processing algorithms

most common type of query

avg length 2.3 words

return docs containing all query words

Real processors

simple-query processor usu. = poor results
increase in quality

scans to end and sorts lists by relevance
but too computationally time-consuming, expensive

Increasing speed

skipping
early termination

can stop processing after short scan

better assignment of doc numbers (??)
caching

4. Shreeves S. L., Habing, T. G., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI Protocol for Metadata Harvesting. Library Trends, 53. Retrieved from http://hdl.handle.net/2142/1754

Open Archives Initiative Protocol for Metadata Harvesting (2001)

scalable solution for community metadata needs
implementation nonspecific

facilitate use in wide variety of institutions and domains

min. use: DC schema

other schemas possible

access to "invisible web" + aggregate sources from diff collections
2 "entities" who use protocol

data providers, aka repositories
service providers, aka harvesters

can build value-added services

Current trends and developments

user group-specific service providers
diff comms develop diff standards in addition to protocol
Open Language Archives Community

language resources

Sheet Music Consortium

particular problem b/c of sheet music, cover art, lyrics, etc.
allows users to annotate metadata

National Science Dig Lib

OAI protocol primary means
build + aggregate collections and services/infrastructure to support activities

Shortcomings of existing registries

usu. very sparse recs about indiv. reps
no search mechanism
ltd browsing
few registers have complete list of all available reps

Developing experimental OAI registry (UIUC)

completeness

inventory of existing registries
following and exploring links
search Google for OAI reps

discoverability

allow for diff views w/o any manual cataloging of OAI reps
automation of data harvesting and indexing

machine processing

turn registry into OAI rep

Future work

for better search and discovery, enhance collection-level desc
increase in automated maintenance of registry
increase in automated discovery of other registries
delegate creation and maintenance of virtual collections, incl. metadata
improve view of search results (contextualization)

ERRoL resolution (Extensible Repository Resource Locators)

"cool URLs" (Berners-Lee) to content and services linked to info in OAI rep
OAI-id for item

Challenges

data provider implementations

many potentially useful features underutilized

metadata

ways of using encoding standards differ
leads to diff relevance for users
++ formats, ++ complex metadata

lack of communication b/w service and data providers

Future directions

development of best practices
Static Repository Gateway (Los Alamos Natl Lab)

low technical entry barrier

mod_ai project

accessible content from Apache open-source servers

OAI rights

means of structured lang w/in protocol

controlled vocabs
gateway to ERRoL service

Meta, meet Data

Friday, November 14, 2014

Week 11: Digital library and web search

No comments:

Post a Comment