OCR and Data Cleaning

Wednesday, March 11, 12:00-1:15 p.m.
Strozier Library R&D Commons (Ground Level)

Deep Learning, Dirty OCR, and the Humanist’s Ever-Changing Toolkit

Few, if any, humanities projects involving data acquisition or digital imaging can be done without some knowledge of Optical Character Recognition (OCR).  And yet OCR is itself a dynamic and changing application. Whether you are interested in data capture, data markup, corpus representativeness, or imaging capability — or, whether you are vaguely curious about the actual, social, or political implications of OCR on your teaching and research, and on the fate of scholarly and public collections — Digital Scholars’ next meeting will be of interest to you. We are pleased to welcome Dr. Allen Romano, Coordinator of FSU’s M.A. in the Digital Humanities, who will lead us in a hands-on exploration of “Dirty” OCR — a term often used to describe electronic forms or documents whose information has been inaccurately rendered.

This event is open — all disciplinary leanings and technical abilities are welcomed! Participants are invited to read the following in advance:

Participants are encouraged to bring laptops or tablets.

We hope you can join us,

-TSG

Post-meeting Resources: Dr. Romano has shared with us the github directory he designed for today. Browse to: https://github.com/allenjromano/dirtyocr/blob/master/dirtyocr.md 

Oceanic Exchanges: Newspaper Corpora and Networks

Wednesday, February 12, 12:00-1:15 p.m.
WMS 415 (from 4th floor elevator, turn L then R)

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914

For data and digital humanists, observing transnational and transcontinental news circulation offers a keen reminder that “news flow” is as much a function of intimate rhizomatic accidents and technological imagination as it is of telegram networks and modal distribution. This is particularly true when the flow occurred without the explicit use of digital tools, though the affordances of now-digital historical methods help to illuminate these accidents and networks in detail. Digital Scholars is pleased to welcome two scholars, Jana Keck and Paul Fyfe, to share Oceanic Exchanges, a series of projects that work toward uncovering the hidden strategies responsible for promoting the transcontinental flow of information about people, places, and global events between 1840–1914. During their virtual visit, Keck and Fyfe will offer stories of its exigence and development, and offer glimpses into how it is is designed to aggregate — in new ways — the vast but disparate linked open data that occurs in extant sources, such as Chronicling America and The Times Digital Archive. Among the many remarkable features of Oceanic Exchanges is its transcontinental construction. Led by Ryan Cordell and Lara Rose, and established to be an accomplished research collective, Oceanic Exchanges boasts a research team of scholars from seven countries in Europe and the Americas, and represents funded support from six national agencies.

Participants are encouraged to bring electronic tablets or laptops, and to read and browse the following resources in advance:

We hope you can join us,

-TSG

 

Collecting Irregular Data on Medieval Manuscripts: “The Tremulator” Four Years Later

Friday, January 31, 12:00-1:15 p.m.
Strozier Library R&D Commons (Ground Level)

“The Tremulator,” Four Years Later

Four years ago this month, Dr. David Johnson presented Digital Scholars with a paleographic tool still under development: “The Tremulator.” Nicknamed after the intricate “layering” of glossed manuscripts in the Middle Ages (such as those produced by the “Tremulous Hand of Worcester” in 13th-century England), this tool was remarkable in two ways: (1) It enabled paleographers to perform scrutinous analysis of medieval inscriptions on something as accessible as a touch-screen device; and (2) it enabled a kind of crowd-sourced cataloguing and visualizing of translative data, especially capturing their various signs of use. As the first speaker in our series on “Using the Humanist’s Tools,” Dr. Johnson will discuss and demonstrate the Tremulator in its current iteration, offering insight into what developers call the “server-side” or “back-end” functions of the tool. Participants are encouraged to bring electronic tablets or laptops, and to browse the following resources in advance:

  • Johnson, David F (2019). The Micro-Texts of the Tremulous Hand of Worcester: Genesis of a Vernacular liber exemplorum. In Ursula Lenker, Lucia Komexl (Eds.), Anglo-Saxon Micro-Texts (pp. 225-266). Berlin, Boston: De Gruyter. https://doi.org/10.1515/9783110630961-012 [stable copy in Canvas org site]
  • Thorpe, Deborah E., and Jane E. Alty (2015). What type of tremor did the medieval ‘Tremulous Hand of Worcester’ have? Brain: A Journal of Neurology, vol. 10, pp. 3123-27. (open-access at Oxford Journals http://brain.oxfordjournals.org/content/138/10/3123)

We hope you can join us,

-TSG

 

Organizational Meeting: Using the Humanist’s Tools

Friday, January 17, 12:00-1:15 pm
Williams 415 [immediate L off elevators, then R down hall to seminar room]

An Introduction to “Using the Humanist’s Tools”

For our first meeting of Spring 2020, we will identify lingering and observable tensions between institutional outcomes and institutional value where the humanities’ involvement in digital scholarship is concerned. We will do so by discussing three different proposals for achieving humanistic inquiry through appropriations of data: Christina Boyles’s 2018 argument for social-justice data curation as an intersectional approach to the digital humanities; Stephen Ramsey and Geoffrey Rockwell’s 2012 argument for a materialist ideology that demonstrates “building things” as legitimate theoretical work; and Lev Manovich’s 1998 argument for the database as an appropriately postmodern logic that harnesses the aesthetic capacities and technical motivations of Web 2.0.

These proposals are, by now, familiar and well circulating for many scholars and teachers of the digital humanities and related fields, yet publishing trends in the humanities show them to be largely unrealized at the institutional level. When we meet, we’ll question these as-yet unrealized goals. Do the proposals languish only within institutions that value external stakes more highly than internal outcomes (i.e., privileging big-data representations, tool development, and high-tech market applications over small-scale data representations or exploratory critical work)? Do they languish as a result of new (or recurring) systemic disagreements about the efficacy of materialist work? Or do they reflect more deeply embedded and conflicting assumptions about what is real in DH research?

While the January 17 meeting is primarily for graduate students enrolled in or regularly attending the group, all Digital Scholars participants are welcome to read and join us for conversation on any of the following:

Participants are encouraged to bring laptops or tablets. We hope you can join us.
-TSG

Using the Humanist’s Tools: Spring 2020 Digital Scholars

Dear Friends of Digital Scholars,

I’m pleased to announce our schedule of topics and speakers for the culminating semester of Digital Scholars, on “using the humanist’s tools,” with all sessions inviting hands-on participation or offering a look into the architecture of particular projects. Please mark your calendars for the following dates:

Friday, Jan. 17, 2020
Organizational Meeting

12:00-1:15 p.m. (WMS 415)

Friday, Jan. 31, 2020
Collecting Irregular Data in Medieval Manuscripts, “The Tremulator,” with David Johnson
12:00-1:15 p.m. (tentatively Strozier Library R&D Commons, ground level)

Wednesday, Feb. 12, 2020
Digitized newspaper corpora and networks, “Oceanic Exchanges,” with Jana Keck and Paul Fyfe [via Zoom]
12:00-1:15 p.m. (WMS 415)

Wednesday, Mar. 11, 2020
Data cleaning for the humanities, “Dirty” OCR Analysis, with Allen Romano
12:00-1:15 p.m. (tentatively Strozier Library R&D Commons, ground level)

Friday, Apr. 3, 2020
Crowd-sourcing cultural citings/sightings, “Dante Today,” with Beth Coggeshall
12:00-1:15 p.m. (WMS 415)

More announcements will follow. We hope you can join us for one or more of these discussions in the spring.
–TSG

How Private is Private?

[Ellie Marvin is a master’s student enrolled in the Digital Scholars reading group this semester.]

Today, I opened the Twitter app and was greeted with a small banner notifying me of upcoming changes to Twitter’s Terms and Conditions. An updated version of their terms will go into effect on January 1, 2020. I quickly dismissed the banner, swiping away to see the content I opened the app to see. After watching the most recent Digital Scholars webinar, however, I decided to investigate further.

During the webinar, Yuwei Lin discussed a recent project in which she asked her students to record themselves asking if people have read Terms and Conditions for many of the apps and devices they use every day. Unsurprisingly, most people confessed they had not read these often long and jargon-filled documents. Anais Nony later brought up the idea of the ubiquitous and deceptive “feeling of consent” which we tend to engage in as a society. We allow ourselves to feel as if we’ve consented to certain kinds of surveillance without fully considering the consequences and how far-reaching that surveillance may be. This blind and blissful ignorance lulls us into a false sense of feeling as though we have control over our data, despite rarely actually looking into where it goes and who owns it.

Twitter has historically been an important social media platform for the growth and development of digital humanities. Twitter is often used in a digital humanities context to spread important academic information, and also to rapidly and collaboratively disseminate and create knowledge. Since Twitter is such an important tool in my field, I feel compelled to use it—even if only to browse other users’ tweets—and should understand what data the app is tracking.

Thus, I decided to read Twitter’s new Terms and Conditions. The terms were easy to find and displayed in large text. There’s an air of openness to Twitter’s Terms and Conditions and its Privacy Policy. Twitter’s Privacy Policy boasts in a large font, “We believe you should always know what data we collect from you and how we use it, and that you should have meaningful control over both.” However, when one delves a bit deeper, it seems clear that there is, in fact, no real privacy on Twitter—which, I suppose, should not come as a shock.

I was a bit upset (yet, still not surprised) to learn about how much data Twitter takes from me and all of its users. I do not like that it claims absolutely no responsibility for content its users post or any fallout from that content. I also do not appreciate the fact that, while Twitter takes no responsibility for this content, it is also able to remove content. Not only that, but Twitter retains a “worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute” any content posted on their site. This is a scary thought and an unpleasant one to have to consider.

One nice thing about Twitter, I will say, is its openness about advertising and the data which it will receive. I discovered a page which each logged-in user can access. The page will show users what data Twitter has gathered from them and what kind of advertisements have been tailored to them. The best part about this feature is that users have the option to turn it off. At any point, I can decide I would not like to have targeted ads and can simply subscribe to the same ads every other generic Twitter use could see.

It seems obvious to me, having now read through Twitter’s rules, terms and conditions, and privacy policy that nothing on Twitter is either private or protected. Therefore, should digital humanists migrate to a new social media platform? Should we refrain from Twitter altogether in the search for something more private? Or is privacy simply a right which we have to allow ourselves to give up in order to engage with a global community?

Webinar: Data Surveillance

With an upsurge in attention toward veillance and transparency practices since Edward Snowden’s 2013 interviews published by The Guardian, public conversations of data surveillance have lately centered on racist and cultural critique. Please join us for our final webinar in the continuing series on “People in Data II,” open to any members of the FSU, FAMU, and TCC communities, as well as greater Tallahassee, the state of Florida, and beyond. This discussion will focus on several aspects of surveillance, from sousveillance alternatives (Steve Mann, 2005) to technological supremacy.

********

WEBINAR: Friday, November 22 – 12:00-1:30 p.m. EST
“Data Surveillance” featuring

  • Yuwei Lin, University of Roehampton [website; blog]
  • Anaïs Nony, University of Fort Hare [website]

Advanced Reading or Browsing
Participants are invited to read the following:

and to browse the following in advance:

Registration
All participants are requested to register at https://app.livestorm.co/florida-state-university-2.

Attending and Connecting
Webinar participants in Tallahassee are welcome to join us in person in the R&D Commons, basement level of Strozier Library, or to connect remotely via LiveStorm. Through the interactive features of our LiveStorm platform, all participants will have the opportunity to submit questions and participate in group chat.

Connection Requirements
Remote participants should ensure or secure the following:

  • Web browser (Edge, Chrome, Firefox, Safari version 10 or greater)
  • Adobe Flash Player version 10.1 or greater
  • Internal or external speaker
  • (recommended: headsets or earbuds for optimum sound)

Connection Troubleshooting
If your email host runs Proofpoint, you may experience some difficulty with the email-based link/button that Livestorm sends you to access the webinar. Should this happen, you can still access the webinar by copying/pasting the webinar url into your web browser, rather than clicking the link/button.

This webinar is made possible through the generous support of FSU’s Office of Research.

We hope you can join us,
— Tarez Graban