“Early Judaism and Modern Technology”

Todd R. Hanneken, St. Mary’s University

for Early Judaism and Its Modern Interpreters, Matthias Henze and Rodney A. Werline, eds. Atlanta: Society of Biblical Literature, forthcoming.

Version History:

  1. October 22, 2018: first draft, this file
  2. December 7, 2018: second draft (LINK)
  3. December 31, 2018: deadline for substantial changes by author (LINK)

The most dramatic development in the work of Early Judaism research over recent decades has been the expansion of digital technology. Computer-aided discovery went from a small niche, using punch cards in the 1960s, to nearly universal. Tasks that were possible with paper, pen, and typewriter became increasingly quick and easy. Tasks that required processing of large data sets beyond human comprehension became possible. By “digital” we mean information is stored, transmitted, and processed as a series of numbers, ultimately ones and zeros in binary code. Some of the advantages of digital technology mirror the changes in scholarship with the advent of the printing press and affordable paper. Like the printing press (and more so), digital technology can create exact duplicates of information. Unlike analog duplicates, each digital copy is identical to the original, no matter how many copies are made. Like the spread of affordable paper, digital information can be stored and transmitted at relatively low cost. Optical media, such as CD-ROM and DVD-ROM, rose above magnetic media for their low cost and were replaced again my magnetic and electronic media with higher capacity. More importantly, the transmission of digital information became quick, easy, and relatively affordable with the spread of standards known collectively as the Internet.

Rudimentary uses of digital technology in Early Judaism research can be thought of as quicker, easier, and cheaper versions of pre-digital technologies, such as paper. One trend in recent decades has been increased utilization of the nature of digital information not only for storage and transmission, but processing. Once information is “machine readable” it becomes more than a conduit of “human readable” information. The machine can find and transform information in ways that would be impossible or extremely time consuming otherwise. Digitization, or making information machine readable, occurs at many levels of abstraction. A page of a book can be digitized at the basic level of an image of the page, with black and white dots representing ink and paper. That information can be stored, transmitted, and presented to another human that may understand it, but the machines themselves have no greater understanding of the content than did the paper. The next level of abstraction is to digitize the text on the page, not just as black and white dots, but encoded as characters in an alphabet. This encoding can be done by human data entry, or through a form of machine learning called Optical Character Recognition (OCR). (The encoding of non-Latin alphabetic characters is another development discussed below.) At this level of machine understanding the text can be searched for text strings, although inexact matches or matches that span lines of text require an additional level of machine understanding. Higher levels of abstraction, easy for an informed human reader, require additional human encoding or machine learning. Humans easily distinguish whether italics indicate a title of a book or journal, a word in a foreign language, or emphasis. We distinguish a series of capital letters as an acronym or a roman numeral, and easily equate different standards for citation. Other levels of data about the data on the page (metadata) might include language and catalog information of the work in which the page is found. Recent decades have seen significant advances in digital technology moving from a “dumb” to “smart” medium through metadata standards, human encoding, and machine learning. Nevertheless, awareness of the challenges and levels of abstraction of machine learning can help the researcher troubleshoot problems. For example, a search for “Is 40:5” may not find a reference to “Isa XL.5.” A search for a word with an “m” may fail if the optical character recognition read “rn” (and failed to detect the language from context, and that the word with “m” is a dictionary word in that language). Machine understanding of information in context is a trend in artificial intelligence applied to Early Judaism research, but cannot yet be taken for granted.

Another general trend in digital technology in Early Judaism research has been progress from proprietary and closed tools to open and interoperable standards. The term “silo” is applied to a software application or website that may be very powerful within itself, but unable to share or receive information from outside sources. In decades past even the simple ability to copy and paste text from a Bible program to a word processor could not be taken for granted. In general this kind of problem occurs when there is no standard for encoding and transmitting information, or the standard is not followed. Many application developers find it easier to reach short-term goals by inventing their own system, rather than adopting a system understood by other applications. The advantages of interoperable standards apply to many levels, including image repositories, textual analysis, and bibliographic data. A simple example can be seen in the development of encoding Hebrew, ultimately leading to Unicode. Hebrew posed challenges mainly in that the alphabet is non-Latin and the direction is right-to-left, with more problems arising with Masoretic pointing. Early systems relied on some degree of transliteration, but were neither standardized nor machine readable. The system most designed for machine processing was Beta Code, which would render אחר as “)XR”. Systems designed to look like Aramaic script in word processing programs were not standardized and relied on tricks with fonts. A font could be designed such that a character “)” or “a” could look like א, but the computer system had no understanding that the language and script were other than English. The user had to type backwards, manually manage line breaks, and tell the spell checker to ignore rHa for אחר. A better solution, though rarely used for Hebrew outside of Israel, was to use an alternative character set. An 8-bit character set can encode 256 distinct characters. Some of those could be assigned to Hebrew letters, but support for additional character sets was limited. The ultimate solution was the development of the Unicode standard, which uses up to sixteen bits per character and has the ability to encode 65,536 characters without tricking an “a” to look like an aleph or alpha. Researchers today are unlikely to encounter problems with character sets unless working with digital materials from before the turn of the century (in which case further reading about ASCII, ANSI, Unicode, UTF-8, ISO-8859, and Windows-1252 might be helpful). Unicode also allows signals for text direction, i.e. switching between right-to-left (RTL) and left-to-right (LTR). In this case the existence of a standard and general compliance does not guarantee there will not be problems across different implementations. Problems with multi-line right-to-left text in otherwise left-to-right paragraphs in Microsoft Word for Macintosh persisted long after standards existed to solve that problem. Other standards deal with much more complicated problems. When successful, standards for interoperability make it possible to aggregate data, search, processing, and visualization from many sources. Again, progress over recent decades is remarkable, but when troubleshooting or identifying limitations in research methods it is often helpful to understand the underlying standards for interoperability.

Specific tools for Early Judaism research are discussed below in the categories of (1) primary sources search and access, (2) secondary sources search and access, (3) images of manuscripts and artifacts, (4) data visualization, and (5) publication and access.

Primary sources search and access

Digital collections of primary sources are widely available and typically divided by language and corpora. Resources are further divisible into those that are freely available and those that require purchase or subscription. With some notable exceptions of projects funded by universities and grants, resources freely available on the Internet often use editions and translations that are in the public domain and out of date. Software packages and subscription services can be expensive for individuals, especially those working in multiple corpora. Research universities typically provide access to visitors physically on campus.

Digital resources are most bountiful for the biblical canon, particularly the Protestant canon. These platforms have been expanded to include additional corpora, including Pseudepigrapha, Philo, Josephus, and the ability to create “custom” versions. Web-based resources such as BibleGateway.com (free, ad supported) offer many translations and simple searching. Locally-installed software such as Logos and Accordance (and BibleWorks until it closed in 2018) offers substantially more power, including search by morphology and instant access to parsing and lexicons. Additional resources are often included or available as upgrade packages (e.g., maps, commentaries, and dictionaries).

For Greco Roman materials, the Perseus Digital Library at Tufts University is an early star of digital humanities projects, having originated in 1985. Texts in Greek and Latin are linked to morphological information, and forms can be entered to show possible and likely parsings and lexicon entries. A related project, Perseids, uses open standards to build editions of ancient documents. Alpheios provides tools for philological analysis. Pelagios extends the principles of Linked Open Data with a focus on geography in the ancient world. These projects originated with a focus on Greek and Latin, and expanded to the classical Mediterranean world. Because they utilize open standards, inclusion of Hebrew and Aramaic materials is easily imaginable. Another free, web-based resource is the Online Critical Pseudepigrapha. Among resources that require a subscription for full access, the Thesaurus Linguae Graecae (TLG) at the University of California Irvine is oldest (1972) and most comprehensive. An abridged collection and lexica are available with free registration. The Loeb Classical Library at Harvard University is also available with subscription in a searchable digital format.

Electronic resources for the Dead Sea Scrolls are available as optional additions to some Bible software packages described above. The most powerful dedicated tool is the Dead Sea Scrolls Electronic Library (DSSEL) published by Brill and Brigham Young University. The transcription and English translations are fully searchable and linked to Palestine Antiquities Museum images, though not necessarily the best available images (for which see below). The DSSEL was published as a specialized application on CD-ROM in 1999 (biblical) and 2006 (non-biblical), and converted to BrillOnline Reference Works in 2015 and 2016, respectively. This resource is available only with subscription, and is not interoperable with open standards.

The oldest and most comprehensive digital collection of Rabbinic Literature is the Responsa Project at Bar-Ilan University. The project traces its origins to the 1960s, and released its first version in 1992. After versions on CD-ROM and USB drive, the project is now available by subscription in a web browser. The project supports browse and search, but lacks interoperability and other advanced features. The Soncino Classics CD-ROM includes Hebrew/Aramaic and English translations of the Babylonian Talmud, Midrash Rabbah, and Zohar. The translation of the Talmuds edited by Jacob Neusner is available as a stand-alone ebook and addition to Logos bible software.

The Comprehensive Aramaic Lexicon at the Hebrew Union College Jewish Institute of Religion includes three million words from the history of the Aramaic language, with morphological parsing and lexical entries. In addition to search and browse, the interface supports “key word in context,” which shows a word with a few words before and after from every instance in the database. Though less directly related to Early Judaism, Syriaca.org at Vanderbilt University is an excellent example of a project that utilizes open standards for linked data, and consequently interoperates with geographic tools such as Pelagios and Pleiades. Similarly, Papyri.info at Duke University exemplifies use of open standards in aggregating information from and about papyri, though most are unrelated to Early Judaism. Coptic Scriptorium also deserves mention as an exemplar of potential of digital tools.

Secondary sources search and access

Secondary literature has several characteristics that make it easier to aggregate and discover than ancient sources. Publications in recent decades are typically “born digital,” meaning they were created on computers in the first place so do not require digitization such as scanning and character recognition. (Errors still occur when a digital source is printed to paper and redigitized.) Modern publications have objective characteristics such as “author” and “date,” unlike ancient sources which may require several paragraphs to describe the likely range of possibilities. Data about data, or metadata, can be entered, aggregated, indexed, and searched far more easily when the metadata is simple and machine readable. Standards for recording bibliographic data certainly exist, yet different interpretations can still cause a search to fail or the same work to appear twice in a search. This is especially the case for translations, multi-volume works, and works in series within a series. For example, the series Discoveries in the Judaean Desert follows a sequence for all volumes in the series, but additional internal numbering adds confusion. The volume scholars call “DJD 13” also includes a cave number (4), the volume number for that cave (8), and a part number (1), in addition to the overall series volume (13), with roman numerals to add to the fun (Qumran Cave 4.VIII: Parabiblical Texts, Part 1 [DJD XIII ; Oxford: Clarendon, 1994]). The combination is confusing enough for beginning scholars in Dead Sea Scrolls research. Machine learning and librarians attempting to fit the reference to an interoperable standard are likely to arrive at different interpretations of the standard or simply make mistakes. To the extent to which modern scholarship falls neatly into the categories anticipated by metadata standards, which is a large extent overall, it is easy for aggregators to collect bibliographic information and make it easily searchable. The largest aggregator of catalog metadata is Worldcat, which ingests catalog information from libraries all over the world. A work is more likely to be duplicated than missing in Worldcat.

Searching for secondary literature becomes more complicated when searching for information not included in the standard library catalog metadata. Unlike catalog data, the contents of a work are typically restricted by copyright. Google Books addresses this problem by indexing all of the content of a book even if it cannot show that content. Thus searching Google Books might indicate if the content of a work matches search terms. Large scale, free resources rely on simple machine learning, which may work well for specific terms but fail to distinguish a search about the Book of Job from a search for a job (employment). Many researchers prefer more focused and/or subscription-based databases that rely more on informed human interpretation. Among free bibliographic search tools related to Early Judaism, the most complete is Rambi, The Index of Articles on Jewish Studies from the National Library of Israel. More focused (but not too narrowly) on Dead Sea Scrolls research is the bibliography maintained by The Orion Center for the Study of the Dead Sea Scrolls and Associated Literature. For the proper amount of money, more often paid by libraries than individuals, subscription services maintain a more curated index, and sometimes the complete work as PDF or eBook. EBSCO Research Databases categorize scholarship into many categories, including the EBSCO Jewish Studies Source. The American Theological Library Association also maintains a Religion Database. Many libraries subscribe to several databases and make efforts to unify search and results, such that users may not need to know the databases involved on the back end.

Many researchers would like to search for secondary scholarship that deals with a particular primary source. This is sometimes easy if the citation appears in the title, keywords, or abstract in an expected form. An index of ancient works cited in a monograph may be searchable in Google Books, but only if the search string matches exactly with no dependence on contextual “common sense.” This situation will improve with better artificial intelligence and better tagging of metadata into machine-readable formats. If the primary source is specifically Talmudic, the Lieberman Index (subscription required) claims to index ancient and modern treatments of any given passage. Researchers may also wish to search for more recent treatments of a subject treated by an older secondary source. It is easy to find bibliography going back in time, but harder going forward. The best resource for searching newer works that cite an older source relevant to a research inquiry is Google Scholar. Links labeled “cited by” and “related articles” may aid discovery, though one may not assume that there are no more citations.

Researchers may also wish to know about works that have not yet, or just recently, appeared in print. Often years go by between the first presentable version of research and the final publication. As discussed below, authors have many options for making their work public other than established print publishers. Google and Google Scholar index major repositories such as Humanities Commons and Academia.edu. Researchers can also search these repositories directly or join them for notifications. Researchers may find relevant news by following the right accounts on Twitter (such as Annette Y. Reed @annetteyreed) or blogs (such as Jim Davila’s PaleoJudaica). Researchers may find that resources published on the Internet may disappear (dead links) for a variety of reasons. Google sometimes displays a recently cached version of a webpage that is currently unavailable. For older dead links, one’s best hope is the Internet Archive’s Wayback Machine. This tool allows users to go to a web address or browse the web as it appeared in the past.

Images of manuscripts and artifacts

For many researchers the most primary of primary sources is not a modern print edition, but a digital facsimile of a manuscript or other artifact. Digital technology has already brought tremendous improvements over microfilm and photographic plates in printed editions. The cost of production and transmission is lower, and quality is typically higher. As high-quality digital scanning expanded in the 1990s and digital photography surpassed film photography in the 2000s, digital access to artifacts expanded and is continuing to expand. For some researchers, the only question is whether the object has yet been digitized and made accessible. For others, various questions determine whether the benefits of digital technology for research into ancient artifacts have already reached maturity or are just beginning to blossom.

One question is whether the information sought is easily digitized. It is easy to create a simple digital equivalent of a photograph or microfilm. Information is not so easily digitized if the markings are damaged or otherwise illegible. In the case of palimpsests (erased and overwritten manuscripts), a simple photograph may not suffice to make the erased text legible. Spectral imaging may be necessary to enhance images. For research in Early Judaism as mediated by Early Christianity, the largest project to make palimpsests legible and available online has been the Sinai Palimpsests Project (free registration required). Artifacts can also be difficult to photograph and digitize if texture is the primary or essential conveyor of meaning. Bad (diffuse) lighting may make cuneiform tablets, stone inscriptions, coins, amulets, and so forth illegible. West Semitic Research pioneered applying technology for dynamic relighting (Reflectance Transformation Imaging) to artifacts related to Early Judaism. Their InscriptiFact Digital Image Library has thousands of relightable images, with thorough catalog information for search and browse (free registration required). The Jubilees Palimpsest Project combines spectral imaging with dynamic relighting for all of Latin Moses (Latin Jubilees and the Testament of Moses), and a few other artifacts.

Another question is whether the researcher already knows the catalog information of the object sought. It is easy to find (or confirm the unavailability) of an artifact if one already knows the owner, shelf mark, and folio or other designator. High quality, sometimes spectrally enhanced, images of the Dead Sea Scrolls are available from the Leon Levy Dead Sea Scrolls Digital Library. Other images are available from the Israel Museum Digital Dead Sea Scrolls. The Aleppo Codex is available as its own site (Flash required). The Leningrad Codex is available from the Internet Archive. Similarly, Codex Sinaiticus and Codex Vaticanus can be viewed online. For lower profile artifacts, the researcher is at the mercy of the holding institution. Some institutions, such as the Bibliothèque nationale de France, have systematic programs for digitization and follow open standards for accessibility. In all these cases, however, images of the artifacts are only discoverable if the researcher already has the catalog information. This could be gained from critical editions, secondary scholarship, or perhaps aggregators such as Trismegistos. As artifacts are increasingly annotated with machine-readable linked data, it will become increasingly effective to search for artifacts not just by owner and shelf mark, but for scribal features (support, columns, lines, hand, provenance) and contents of the text.

Another question that will determine one’s experience of the progress already in made in digital access to artifacts is what one wishes to do with the images. If one wishes only to read a text on screen, one can expect decent options for pan and zoom. If one wishes to recontextualize the image in any way, it will make a difference whether the image source complies with standards for interoperability. Many of the aforementioned sites are closed silos, and seem to wish to prevent the user from saving the image (although it is difficult to prevent a simple screen capture). Other sites favor open standards for interoperability. Most significant is the International Image Interoperability Framework (IIIF), which defines standards to allow image repositories to connect outside their silos. The IIIF Image standard allows humans and machines to specify the region and resolution of an image desired. For example, if one wished to include an image from a repository in one’s own web page one could specify the region and size desired directly in the web address to the repository. This replaces downloading the image, opening it in an editor, cropping, resizing, saving, and uploading to a new server. The IIIF Presentation standard allows collections and reconstructed codices to be built from images and annotations in various repositories. This can be thought of as a replacement for printing every page and rearranging the pages in piles on a desk, but potentially on a much larger scale. Once information and its relationship to other information becomes machine readable through defined standards, the possibilities for computer-assisted recontextualizing of information become limitless.

Data visualization

Sometimes discovery and learning benefits from rendering data in ways other than linear strings of text. Data visualization can communicate in a glance what otherwise would have required extensive work and abstract thinking. One of the core advantages of digital processing is the ability to store and process massive quantities of data. The great pre-digital scholars were able to comprehend, retain, and notice patterns in huge amounts of literary data, but even they had their limits. Visualization tools that developed in the past decades have the ability to summarize information that would have been extremely time consuming or impossible in earlier generations.

For example, “word clouds” quickly visualize the words that appear most frequently in a set of text by rendering the more frequently used terms in larger letters. This can quickly convey themes and emphases in a work. One could quickly visually the frequency of personal names that appear in a work, such as the Hebrew Bible, and compare it to the relative frequency of those names in the New Testament or Talmud. If properly coded, names could be expressed in colors for gender, ethnicity, and any other object of study. Color can be used to express any dimension in a data set using “heat maps.” Charts can express the relative frequency of a lexical variant or synonym in one corpus or period relative to others. The “key word in context” became more popular and easier to generate with digital texts, and shows more of the context that a lexicon or concordance normally would. One can also easily create geographic maps with pins or colors representing mentions or more detailed information about place names in a work. In the past scholars have argued that geographic information mentioned in a work (if accurate) might indicate provenance of composition. Simple mapping software makes it easy to apply that line of inquiry to any text, compare it to other texts, and present arguments visually to reach a wider audience more quickly. In general, research questions that might have been intuited or manually tabulated with relatively small and well referenced corpora such as the biblical canon can be asked of much larger corpora as long as they are adequately machine-readable.

Publication and access

Digital technology has not replaced the conference paper and printed volume, but it has added substantial new options. Email might be thought of as quicker and easier version of pre-existing media, such as mail. Other electronic media facilitate communication globally that before could only have been imagined in physical proximity. Web logs (blogs) and then Twitter became an easy way to share announcements and ideas, especially in their nascent stages. Academia.edu gained popularity as a resource for authors to share their ideas and reach readers (and also gained controversy in its for-profit use of personal information). Non-profit alternatives such as Humanities Commons and institutional repositories were built to have the same or improved capabilities for search, notification, and discussion without selling personal information. In addition to published material, such online forums can be used for conference papers, slideshows, syllabi, data sets, videos, etc. Audio-visual materials are more common for reaching popular audiences (e.g., the Society of Biblical Literature’s Bible Odyssey project or James McGrath’s “Religion Prof” podcast) but that could easily change.

Even with some help from the Internet Archive’s “Wayback Machine,” it is reasonable to wonder if digitally disseminated information and ideas will have the endurance of printed paper volumes or the parchment and papyri we study. Almost all the information we have from antiquity, we have not because it was durable but because it was copied. It was copied because it was deemed worthy of copying. To the extent that information on the Internet is deemed worthy of copying and archiving it will be preserved more easily than its predigital analogs. The copying of digital information is the easy part. Archiving requires attention to formats. Portable Document Format (PDF) is popular as a substitute for paper, and thus is very human readable, but less so machine readable. For important works and editions, the Text Encoding Initiative provides the archival standard for texts to be readable to machines as well as humans. If the strategy for preservation of information is to rely on copies, copyright restrictions are the impediment. At least in the United States, the Digital Millennium Copyright Act provides protection for providers of online services, and legal precedent allows fair use including backup copies for personal use. The best strategy for an author who wishes her ideas to be disseminated and preserved is to give explicit permission using a standard license. Whereas writing your own permission statement requires human interpretation, Creative Commons provides a set of licenses that are standard and machine readable. One should also consider whether ideas “published” on the Internet are public to machine web crawlers such as Google and the Internet Archive or hidden behind a registration or pay “wall.” Proprietary for-profit websites do not guarantee preservation of information, especially if they cease to be profitable. Digital publication has the potential but not the assurance to rival and surpass pre-digital media for sustainable preservation and access of scholarship.