The Afro-asiatic corpus : CorpAfroAs

Access the Corpus

  • The initial access to the CorpAfroAs corpus page displays the files of the corpus in a tree structure based on the languages.
  • The files of the corpus are identified by their titles and by their filename (between parenthesis). Each item in the corpus is linked to an audio file and its corresponding ELAN annotation file, except for the PDF files that are grammatical sketches on the language.
  • An information button will display the OLAC metadata for each pair of related files (or PDF file).
  • To the top right of the screen, there is a 'Register' button allowing you to ask for a login that will give you access to the corpus. For the moment, only online queries are possible. In order to register, you have to accept the Copyright and citation rules and the Ethical rules.
  • To the left of the register button, there is a 'Connect' button allowing you to connect to the database to search the corpus.
  • To access the corpus, check the files you are interested in (without connection, only a few sample files may be searched for experimentation purposes), then click on the magnifying glass icon to the top of the list, in front of 'ELAN files'.

  • Once connected (login and password accepted), the annotated files you are given right to display can be showed by clicking on their identifier name (e.g ARY_AB_NARR_02)
  • The files you are given right to research are preceded by a checkbox,
  • The files whose checkbox have been checked will form the domain in which the research will be carried out.
    To (un)select a group of files belonging to the same folder, click on the checkbox to the right of this folder's name.
  • Other buttons appear depending on your specific rights.
    The ELAN icon in front of each filename (for those who have the authorization) allows the downloading of the corresponding ELAN file (right click on it, then choose 'save the target as...')
  • The WAV icon allows the downloading of the corresponding audio file
  • To search the sub-corpus defined by the checked files, click the magnifying glass icon to the top of the list, in front of 'ELAN files'.

  • In order to search files, you will need the list of abbreviations used by the researcher (downloadable from here). This list is complemented by the grammatical sketch of the language (downloadable from this website), which tells you more about the labels and categories in that particular language.

    The lists and concordances form

  • the List button creates the list of the different words (mot), morphemes (mb), gloss (ge) or categories (rx) of the sub-corpus defined by the search domain, with their number of occurrences. Depending on the order choice, this list can be alphabetical or arranged by decreasing number of occurrences.

  • One can access the list of occurrences of an item by clicking on its value.

    From that page, it's possible to

    • display an annotated prosodic unit contanining the item by clicking on its identifier. The unit can then be played.
    • select different annotation units containing the item to display them in a new page by clicking the Show selected items button.
    The Concordance button creates a list of the words matching the regular expression given in the (Part of) word concordances box, with their left and right contexts. The words matching the regular expression are centered into the page.
    The prosodic units where these words appear can be displayed (then listened to) one by one by clicking on their corresponding identifier. (Here, the concordance of the word 'o:mhi:n', then the displaying of the prosodic unit BEJ_MV_NARR_02_farmer_313)

    The unit display can then be enlarged on both sides by giving the number of units desired on the left and on the right, then by clicking the Extend display button.

    The query form

    The query engine used for the CorpAfroAs corpus is based on the mfSearch package from the Max Planck Institute for Psycholinguistics, Nijmegen. The query interface reproduces ELAN software's one. Two complementary functions have been added. Here is a description of the form

    The Search window presents two areas:

  • The search domain previously defined (here 8 files). When the mouse is moved over this area, the list of files is displayed in a popup.
  • The Search form
    By default
    -> case sensitive : uppercase and lowercase are not equivalent
    -> regular expression : how the search targets and contexts must be interpreted (cf. bottom of the page)
    -> minimal duration : search only in units of this minimal duration (0 = any duration)
    -> maximal duration : search only in units of this maximal duration (0 = any duration)
    -> Target : searched sequence. Then don't forget to specify, at the right of the layer, the tier (or tier type) where you want to search for this sequence (morpheme, word, gloss or category tiers...).

    (In the screenshot above, one searches for the label 'OBL' in tiers of 'ge' type)

  • Multiple layer search

    You can refine your initial search by adding constraints in the layer below. In this case, you will have to choose the type of constraint you want to impose to the targets.

    • Fully aligned : both annotation cells must have the same temporal duration
    • Inside : the upper cell must be a child of the bottom one (like 'ge' child of 'mb')
    • Within : the upper cell must be a parent of the upper one (like 'mot' parent of 'mb')
    • Overlap : there must be a temporal overlap between the upper cell and the bottom cell
    For example, after having found 1566 morphemes tagged as 'demonstrative' in the corpus ('DEM' in 'ge' type tiers), one would like to know how many are of 'proximal' type ('PROX' in 'rx' type tiers).
      \bDEM\bTier Type: rx
      fully aligned
      \bPROX\bTier Type: ge
    '\b' means a word boundary (cf. regular expressions below)

    (In the second layer of the previous screenshot, the regular expression '.' (meaning any sequence of character) searched for in the 'mb' type tiers, is not actually 'a constraint', but a mean to capture the various values, =ha, -ti, =u... of the morphemes tagged as 'OBL', as illustrated by the hits screenshot)

    -> Context : It is possible to add constraints regarding the context of the target (horizontal constraints), i.e. on the left and right environment of the searched target, at a fixed distance (=x) or a limited one (<x) in number of annotations.
    For example, to find the nouns ('N' in 'rx' type tiers) directly followed by a determiner ('DET' in 'rx' type tiers), one will search:

      Target  Right Context
      \bN\b= 1\bDET\bTier Type: rx
    = 1 between Target and Right Context means 'at the distance of one annotation to the right'.
    = 0 would mean 'in the same annotation'
    < 2 between Left Context and Target would mean 'with an annotation containing the left target sequence at a distance of zero or one annotation at left of the target one

    -> The Clear button clears the form content
    -> The Find button launches the search

    Regular expressions
    Regular expressions provide flexible means to match strings of text like 'beginning with', 'ending with', 'any from a list'... By default, the sequences in target or context boxes are searched inside the annotations. Then, for example, searching label 'PFV' will retrieve also the annotations labelled 'IPFV', 'IPFV.3SG.F'...
    \b is a mark for a word frontier (beginning, end, ponctuations). ex: \bIPF\b = only the 'IPFs' inside an annotation (complex or not)
    ^means beginning of the text. ex: ^N = all annotations beginning by N; ^- = all annotations that are suffixes (in this corpus, prefixes present a hyphen to the right, suffixes, a hyphen to the left)
    $end of text. ex: -$ = all annotations that are prefixes
    .any single character. (if nothing after, it will be interpreted as 'any sequence of characters')
    \. or [.]the character '.'
    ?the previous character or no character. e.g.: 'gr?ave' will match 'gave' and 'grave'
    [?]the character '?'
    +the previous character at least one time. e.g.: 'me+t' will match 'met', 'meet'...
    [aeiou]one of these vowels. e.g.: 'p[aeiou]pe' will match all the words 'pape', 'pipe', 'pepe'...
    [^ptk]any character but 'p', 't' or 'k'
    [a-h]any letter between 'a' and 'h'
    NOT()annotation not containing the text between parenthesis. ex: NOT(\.) in rx or ge = the plain annotations (not complex, i.e without '.')