Meta, meet Data

Friday, December 5, 2014

Week 14: Security, Privacy, and Cloud Computing

1. O'Harrow, R. (2005). Chapter 10. In No Place to Hide: Behind the Scenes of Our Emerging Surveillance Society (281-300). New York: Free Press.

electronic surveillance: future of data collection
transit cards monitor traffic, travel activity
hand readers @ workplaces instead of traditional punch cards
GPS, CCTV
tollbooths as security points

e-toll credits to verify location

RFID (radio frequency id) @ heart of system

"virtual borders"

"why worry if you have nothing to hide"? --> awkward logic?
surveillance as defense/security ---> but v. what/who?

2. Jaeger, P., Lin, J., Grimes, J., & Simmons, S. (2009). Where is the cloud? Geography, economics, environment, and jurisdiction in cloud computing. First Monday, 14(5). http://firstmonday.org/ojs/index.php/fm/article/view/2456/2171

3. Library Data in the Cloud - National Information Standards Organization. (n.d.). Retrieved November 21, 2014, from http://www.niso.org/news/events/2014/virtual/data_in_the_cloud/

4. Cloud Computing Online Training. (2014, Mar 3) Learning Cloud Computing With Amazon Web Services What Is The Cloud. Retrieved from https://www.youtube.com/watch?v=Neys3rci14o

cloud computing: large data centers with enough dynamism to make scalable for users

functionality depends on size and continuity:
efficient flow of data

although not familiar w/ term or unaware of own use, many ppl already involved in it

ex. Gmail, Flickr

"cloud" not just physical machines

also raises policy issues

diff components

infrastructure

computational resources
storage
ex. Amazon Elastic Compute Cloud

platform

software stack
ex. Google App Engine

application

Web services running on top of cloud computing component

What is...?

cloud computing offers possible solutions to "Web-scale" challenges in processing data
commercialization of "utility computing" services and development

addtl revenues
consolidation: overall reduced costs

liberates users from maintaining infrastructure

Who uses...?

app hosting

cloud provider w/ maintenance tasks

batch processing

large amt of data

temporary use x existing IT infrastructure, aka cloud bursting

temporary/seasonal peaks

user data + apps in cloud cluster

owned and maintained by provider
legal issues?

Where is...?

centralization of info + countless computing resources
location of data centers a major issue: possibility of portable dc?

suitable physical space (at least warehouse-sized)
near high-capacity Internet connections
lots of affordable electricity/other energy resources
laws of jurisdiction

adjudication of cases?
govt intervention?
costs

Rules and policies

users expect reliable, high-speed 24/7 access
also secure and private connections
liability + intellectual property + ownership of data
easy transfer of data
for corporations: ability to be audited

Week 12: Muddiest points

1. I'm familiar with the concept of folksonomy as an active user of the photo-sharing site Flickr, but I'm wondering how extensive the use is as a supplement to the controlled vocabulary provided by other institutions, or whether adoption of what was before a folksonomic term depends on the frequency/popularity of that term.

Friday, November 21, 2014

Week 12: Web 2.0, Social Media, and Libraries

1. Kaplan, A. M., & Haenlein, M. (2010). Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons, 53(1), 59-68. doi: 10.1016/j.bushor.2009.09.003

social media popular but still unclear definition

difference from web 2.0 and user-generated content?
some cos. remain uncomfortable with "freer" customer/client interaction

less "control" on part of co.

but s.m. <> www as platform for exchanging info

form +++ powerful than 1970s BBS

what it is/n't

1959: Open Diary, (we)blog
1979: Usenet (Duke)
2000s: high-speed Internet access

03: MySpace
04: Facebook

2004: Web 2.0 = ideological + technical foundation

new way that software devs + users collab in www
content + apps continuously modified
basic fxnalities: Flash, RSS, AJAX (.js)

2005: User-generated content (UGC)

published on publicly-accessible site or social networking site

excludes emails/IMs

creative

excludes existing content

"amateur"

excludes commercial purpose

s.m. = Internet-based apps combining Web 2.0 + UGC

apps heterogeneous
but no systematic way s.m. apps can be categorized
possibly: "richness" of medium + degree of social presence

Challenges and opportunities of s.m.

collaborative projects

joint outcome may be better than individual efforts
wikis v. social bookmarking

blogs

usu. by 1 indiv., but can provide forum for interaxn
increasingly adopted by firms

content communitites

media content between users
YouTube, Flickr, Slideshare
copyright??

social networking sites

personal info, but also brand communities
Facebook, MySpace

virtual game worlds

highest level of richness + social presence
World of Warcraft, Everquest

virtual social worlds

similar to game worlds, except no rules for possible interaxns
Second Life

Companies and social media

choose appropriate medium for purpose
select app or make own
ensure s.m. activities align w/ each other
also w/ firm's overall media strategy
access for all employees
stay active, interesting, humble, slightly informal, honest

2. Lankes, R. D., Silverstein, J., & Nicholson, S. (2007). Participatory networks: The library as conversation. American Library Association. Available at http://quartz.syr.edu/rdlankes/Publications/Others/ParticiaptoryNetworks.pdf

Libs in "convo business"

knowledge business --> "convo business"

ppl learn through convo
info lit + critical thinking
convo w/in individual: metacognition

how can web 2.0, social media further facilitate ideas traditionally provided by brick-and-mortar lib?
tech --> new possibilities for reaching ideals

Tech integration

usefulness of tech must me measured v. against lib. mission
social networks
wikis: mass decision-making
loosely coupled APIs (application programming interface)

"convo" b/w apps
Google Maps

mashups: ease of incorporation
permanent betas

Google Labs, MIT Libs

+++ users, improved software
folksonomies: UG classification

Core new tech: AJAX and Web services

AJAX: Asynchronous JavaScript and XML

browser < data > server w/o refreshing entire page
open-source, light programming skills

Web services

software-software interaxns
e.g. ISBN no. to search multiple catalogs
lightweight, aggregate for +++ fxnality

Library 2.0

which apps for which purposes? strategies?
choose appropriately for user participation
social networking sites

Participatory librarianship in axn

connect w/ constituencies and other institutions
Worldcat
informalize the catalog

enhance info provided
incorporate folksonomies

reference x community involvement

develop online knowledge base
offer + meeting spaces
+ access points
community repositories?

institutional, digital repositories

3. Salomon, D. (2013). Moving on from Facebook Using Instagram to connect with undergraduates and engage in teaching and learning. College & Research Libraries News, 74(8), 408-412. Available at http://crln.acrl.org/content/74/8/408.full

Study at UCLA Powell Library

use of Instagram to reflect undergrad pop.

students doc. time in lib via app
even w/ low no. of followers @ beginning, + interactive than FB

Instagram 3rd most pop. in U.S.

still visual, but move away from text stimulation?

allow integration of lib activites and uni curriculum
social media: addtl factor for measuring impact on student success?
another way for lib to be engaged, to reject stereotypes of "stuffiness"?

Week 11: Muddiest points

1. The following question, I think, is beyond the scope of the class, but I will ask anyway: since I'm interested in audiovisual collections, I was wondering about the barriers not just in access and continued (or any) use, but funding and sustaining such materials. Here am I thinking about finding and then maintaining the equipment required for digitization, or even just playback.

2. Regarding institutional repositories, it seems as though it's mainly geared toward faculty, and even then, perhaps some faculty may not be interested or are aware of such a resource for their preprints, etc.. I'm not quite sure whether there is the same push for students---especially, for instance, undergraduates working on their senior theses---to deposit their work in the IR.

Friday, November 14, 2014

Week 11: Digital library and web search

1. Paepcke, A., García-Molina, H., & Wesley, R. (2005). Dewey Meets Turing Librarians, Computer Scientists, and the Digital Libraries Initiative. D-Lib Magazine, 11. Retrieved from http://www.dlib.org/dlib/july05/paepcke/07paepcke.html

NSF --> Digital Libraries Initiative (1994)

collaboration librarians x CSists

research x daily affairs, aka theory x practice
shared values

need to share w/ wider community
linkage of reliable info not just for "info pros" but also CS

Google one of many results
how to access, share funding?

misconceptions from both parties

"hubs" as new framework for collections online
connections b/w librarians <> scholarly authors

2. Lynch, C. A. (2003). Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age. Association of Research Libraries, 26. Retrieved from http://www.arl.org/storage/documents/publications/arl-br-226.pdf

Institutional repositories (2002)

definition

provides services to uni community for mgmt and dissemination of digital mats.

work by both fac & students
research & teaching

stewardship of such mats.

also data

supported by diff. techs.

++ accountability for unis

++ active role in scholarly publishing
forging more strategic, mutually beneficial alliances

New patterns in access/dissemination

decrease in online storage costs
standards for metadata --> interop.

MIT DSpace x HP (2003)

model for other reps both in the U.S. and internationally
open-source software

esp. important for institutions w/ significantly lower endowments/resources

Strategic importance

near-term & long-term preservation of scholarly works, esp. by faculty
supplementary materials

preprints? "first access"

also affiliation w/ institution
what is worth collecting?
encourage faculty to use institution resources

complement to disciplinary repositories

Potential dangers

institutional control over intell. property
centralization (inst.) v. decentralization (discipline/dept)

risk of inappropriate policy constraints?

too fashionable?

hasty implementation w/o judging merits or sustained commitment?

Networked info standards and infrastructure

preservable formats
identifiers

persistent and consistent reference to mats.

rights doc. and mgmt

again, metadata
but also controlled vocab (?)

3. Hawking, D. (2006). How Things Work: Web Search Engines: Parts 1 and 2. IEEE Computer. Retrieved from http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf

Data processing

tools and interfaces have many of same data structures and algorithms in common
search engines can't/shouldn't index all pgs

b/c no. of pgs is infinite

more useful to

reject "low-value content"
ignore huge vols. of accessible data

Problems and techniques

multiple locations for data centers

helps tolerate redundancy and faults
PC types depends on factors like price, speed, memory, physical size, etc.
clusters can target specialized functions

ex. crawling, indexing, replication

Crawling algorithms

queue of unvisited URLs

started by 1 or more "seed" URLs, then HTTP request
huge data structure required

real crawlers

different speeds
risk of server overload

only 1 req/server
"politeness" delay b/w requests

excluded content

check site's robots.txt file
to see whether parts or all of site should be crawled

duplicate content

unrecognized duplicates could be links to other duplicates
early detection necessary

continuous crawling

full crawls at fixed intervals might slow processing
instead install priority queue

spam rejection

Indexing algorithms

use inverted files for rapid indexing
2 phases

scan text of each doc
inversion (?)

Real indexers

store addt'l info in postings

ex. term frequency, positions

scaling up

doc partitioning

term lookup
compression for key structures
precomputing for common phrases
indexing anchor text w/ target & source (?)

useful for descriptions

popularity score of pages

derived from frequency of incoming links
ex. PageRank

query-independent score

internal ranking
++ score, ++ retrieval probability

Query-processing algorithms

most common type of query

avg length 2.3 words

return docs containing all query words

Real processors

simple-query processor usu. = poor results
increase in quality

scans to end and sorts lists by relevance
but too computationally time-consuming, expensive

Increasing speed

skipping
early termination

can stop processing after short scan

better assignment of doc numbers (??)
caching

4. Shreeves S. L., Habing, T. G., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI Protocol for Metadata Harvesting. Library Trends, 53. Retrieved from http://hdl.handle.net/2142/1754

Open Archives Initiative Protocol for Metadata Harvesting (2001)

scalable solution for community metadata needs
implementation nonspecific

facilitate use in wide variety of institutions and domains

min. use: DC schema

other schemas possible

access to "invisible web" + aggregate sources from diff collections
2 "entities" who use protocol

data providers, aka repositories
service providers, aka harvesters

can build value-added services

Current trends and developments

user group-specific service providers
diff comms develop diff standards in addition to protocol
Open Language Archives Community

language resources

Sheet Music Consortium

particular problem b/c of sheet music, cover art, lyrics, etc.
allows users to annotate metadata

National Science Dig Lib

OAI protocol primary means
build + aggregate collections and services/infrastructure to support activities

Shortcomings of existing registries

usu. very sparse recs about indiv. reps
no search mechanism
ltd browsing
few registers have complete list of all available reps

Developing experimental OAI registry (UIUC)

completeness

inventory of existing registries
following and exploring links
search Google for OAI reps

discoverability

allow for diff views w/o any manual cataloging of OAI reps
automation of data harvesting and indexing

machine processing

turn registry into OAI rep

Future work

for better search and discovery, enhance collection-level desc
increase in automated maintenance of registry
increase in automated discovery of other registries
delegate creation and maintenance of virtual collections, incl. metadata
improve view of search results (contextualization)

ERRoL resolution (Extensible Repository Resource Locators)

"cool URLs" (Berners-Lee) to content and services linked to info in OAI rep
OAI-id for item

Challenges

data provider implementations

many potentially useful features underutilized

metadata

ways of using encoding standards differ
leads to diff relevance for users
++ formats, ++ complex metadata

lack of communication b/w service and data providers

Future directions

development of best practices
Static Repository Gateway (Los Alamos Natl Lab)

low technical entry barrier

mod_ai project

accessible content from Apache open-source servers

OAI rights

means of structured lang w/in protocol

controlled vocabs
gateway to ERRoL service

Week 10: Muddiest points

1. Must XML attributes and elements always be quoted? In HTML, for example, one can code the link as:

<a href = http://www.url.com/>site</a>

or

<a href="http://www.url.com/">site</a>

2. What are some interoperability issues when using XML -- for instance, in using Unicode v. ASCII?

Friday, November 7, 2014

Week 10: XML

1. Martin Bryan. Introducing the Extensible Markup Language (XML): http://www.is-thought.co.uk/xmlintro.htm
2. Extending you Markup: a XML tutorial by Andre Bergholz : http://xml.coverpages.org/BergholzTutorial.pdf
3. XML Schema Tutorial http://www.w3schools.com/Schema/default.asp

XML: subset of SGML (Standard Gen. Markup Lang.)

clearly mark boundaries of elements in DTD (Doc Type Def)

dec: <!DOCTYPE>
con: namespaces + DTD don't work well together

this delineation enforces strict implementation

ex. 1st-level heading implemented before 2nd-level, etc.

extends link capabilities w/ 3 supp. lang

Xlink: 2 docs
XPointer: individual parts of XML doc
XPath: used by previous to describe loc paths

loc path: axis, node test, predicate

XML not designed to be standardized

multiple files for compound docs

XML docs: formal syntax for series of entities

ea. entity can contain 1+ elements
ea. element can contain 1+ attributes (process)
3 types of markup

document instance (what kind)
optional: processing instruction (how to read)
optional: doc type declaration (formal markup declarations)

Use

markup tags (defined by trade org or other body)

e.g. <to> content </to>

possible to define own sets

create DTD w/ formal id of relationships b/w elements
and also define attributes

Standard and non-standard text elements (??)

commonly used text: text entity
non-standard: system-dependent entities can be declared

Illustrations and other special elements

special notation either as entity or attribute
notation declaration

to designate action for unparsed data in ref file

XML schema

allows user to define data types
goal: to replace DTDs
4 schema

DDML: doc def markup lang
DCD: doc content desc
SOX: schema for object-oriented XML
XML-Data (replaced by DCD)

Example

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.mypage.com/">

<xs:element name="content">
</xs:element>
</xs:schema>