Developer Interface#

Configuration#

Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching). Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession or Item object, or in a config file.

The easiest way to create a config file is with the configure function:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password')

Config files are stored in either $HOME/.ia or $HOME/.config/ia.ini by default. You can also specify your own path:

>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')

Custom config files can be specified when instantiating an ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')

Or an Item object:

>>> from internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')

IA-S3 Configuration#

Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.

They can be specified in your config file like so:

[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'

Logging Configuration#

You can specify logging levels and the location of your log file like so:

[logging]
level = INFO
file = /tmp/ia.log

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)

By default logging is turned off.

Other Configuration#

By default all requests are HTTPS. You can change this setting in your config file in the general section:

[general]
secure = False

Or, using the ArchiveSession object:

>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})

In the example above, all requests will be made via HTTP.

ArchiveSession Objects#

The ArchiveSession object is subclassed from requests.Session. It collects together your credentials and config.

get_session(config: Mapping | None = None, config_file: str | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None) session.ArchiveSession[source]#

Return a new ArchiveSession object. The ArchiveSession object is the main interface to the internetarchive lib. It allows you to persist certain parameters across tasks.

Parameters
  • config – A dictionary used to configure your session.

  • config_file – A path to a config file used to configure your session.

  • debug – To be passed on to this session’s method calls.

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

Returns

To persist certain parameters across tasks.

Usage:

>>> from internetarchive import get_session
>>> config = {'s3': {'access': 'foo', 'secret': 'bar'}}
>>> s = get_session(config)
>>> s.access_key
'foo'

From the session object, you can access all of the functionality of the internetarchive lib:

>>> item = s.get_item('nasa')
>>> item.download()
nasa: ddddddd - success
>>> s.get_tasks(task_ids=31643513)[0].server
'ia311234'

Item Objects#

Item objects represent Internet Archive items. From the Item object you can create new items, upload files to existing items, read and write metadata, and download or delete files.

get_item(identifier: str, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) item.Item[source]#

Get an Item object.

Parameters
  • identifier – The globally unique Archive.org item identifier.

  • config – A dictionary used to configure your session.

  • config_file – A path to a config file used to configure your session.

  • archive_session – An ArchiveSession object can be provided via the archive_session parameter.

  • debug – To be passed on to get_session().

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

  • request_kwargs – Keyword arguments that requests.Request takes.

Returns

The Item that fits the criteria.

Usage:
>>> from internetarchive import get_item
>>> item = get_item('nasa')
>>> item.item_size
121084

Uploading#

Uploading to an item can be done using Item.upload():

>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')

Or internetarchive.upload():

>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')

The item will automatically be created if it does not exist.

Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.

Setting Remote Filenames#

Remote filenames can be defined using a dictionary:

>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})
upload(identifier: str, files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, **get_item_kwargs) list[requests.Request | requests.Response][source]#

Upload files to an item. The item will be created if it does not exist.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.

  • metadata – Metadata used to create a new item. If the item already exists, the metadata will not be updated – use modify_metadata.

  • headers – Add additional HTTP headers to the request.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • queue_derive – Set to False to prevent an item from being derived after upload.

  • verbose – Display upload progress.

  • verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.

  • checksum – Skip uploading files based on checksum.

  • delete – Delete local file after the upload has been successfully verified.

  • retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.

  • retries_sleep – Amount of time to sleep between retries.

  • debug – Set to True to print headers to stdout, and exit without sending the upload request.

  • validate_identifier – Set to True to validate the identifier before uploading the file.

  • **kwargs – Optional arguments that get_item takes.

Returns

A list Requests if debug else a list of Responses.

Metadata#

modify_metadata(identifier: str, metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, request_kwargs: Mapping | None = None, **get_item_kwargs) requests.Request | requests.Response[source]#

Modify the metadata of an existing item on Archive.org.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • metadata – Metadata used to update the item.

  • target – The metadata target to update. Defaults to metadata.

  • append – set to True to append metadata values to current values rather than replacing. Defaults to False.

  • append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.

  • priority – Set task priority.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • debug – set to True to return a requests.Request object instead of sending request. Defaults to False.

  • **get_item_kwargs – Arguments that get_item takes.

Returns

A Request if debug else a Response.

The default target to write to is metadata. If you would like to write to another target, such as files, you can specify so using the target parameter. For example, if we had an item whose identifier was my_identifier and you wanted to add a metadata field to a file within the item called foo.txt:

>>> r = modify_metadata('my_identifier', metadata={'title': 'My File'}, target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'

You can also create new targets if they don’t exist:

>>> r = modify_metadata('my_identifier', metadata={'foo': 'bar'}, target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}

Downloading#

download(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, timeout: int | float | tuple[int, float] | None = None, **get_item_kwargs) list[requests.Request | requests.Response][source]#

Download files from an item.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given file names.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • dry_run – Print URLs to files to stdout rather than downloading them.

  • verbose – Turn on verbose output.

  • ignore_existing – Skip files that already exist locally.

  • checksum – Skip downloading file based on checksum.

  • destdir – The directory to download files to.

  • no_directory – Download files to current working directory rather than creating an item directory.

  • retries – The number of times to retry on failed requests.

  • item_index – The index of the item for displaying progress in bulk downloads.

  • ignore_errors – Don’t fail if a single file fails to download, continue to download other files.

  • on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).

  • return_responses – Rather than downloading files to disk, return a list of response objects.

  • **kwargs – Optional arguments that get_item takes.

Returns

A list Requests if debug else a list of Responses.

Deleting#

delete(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, cascade_delete: bool = False, access_key: str | None = None, secret_key: str | None = None, verbose: bool = False, debug: bool = False, **kwargs) list[requests.Request | requests.Response][source]#

Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given filenames.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • cascade_delete – Delete all files associated with the specified file, including upstream derivatives and the original.

  • access_key – IA-S3 access_key to use when making the given request.

  • secret_key – IA-S3 secret_key to use when making the given request.

  • verbose – Print actions to stdout.

  • debug – Set to True to print headers to stdout and exit exit without sending the delete request.

Returns

A list Requests if debug else a list of Responses

File Objects#

get_files(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, on_the_fly: bool = False, **get_item_kwargs) list[files.File][source]#

Get File objects from an item.

Parameters
  • identifier – The globally unique Archive.org identifier for a given item.

  • files – Only return files matching the given filenames.

  • formats – Only return files matching the given formats.

  • glob_pattern – Only return files matching the given glob pattern.

  • exclude_pattern – Exclude files matching the given glob pattern.

  • on_the_fly – Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).

  • **get_item_kwargs – Arguments that get_item() takes.

Returns

Files from an item.

Usage:
>>> from internetarchive import get_files
>>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')]
>>> print(fnames)
['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']

Searching Items#

search_items(query: str, fields: Iterable | None = None, sorts=None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, archive_session: session.ArchiveSession | None = None, config: Mapping | None = None, config_file: str | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) search.Search[source]#

Search for items on Archive.org.

Parameters
  • query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.

  • fields – The metadata fields to return in the search results.

  • params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.

  • full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].

  • dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend !L `` to the ``full_text_search query [default: False].

  • secure – Configuration options for session.

  • config_file – A path to a config file used to configure your session.

  • http_adapter_kwargs – Keyword arguments that requests.adapters.HTTPAdapter takes.

  • request_kwargs – Keyword arguments that requests.Request takes.

  • max_retries

    The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:

    >>> s = get_session()
    >>> s.mount_http_adapter()
    >>> search_results = s.search_items('nasa')
    

    See ArchiveSession.mount_http_adapter() for more details.

Returns

A Search object, yielding search results.

Internet Archive Tasks#

get_tasks(identifier: str = '', params: dict | None = None, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) set[catalog.CatalogTask][source]#

Get tasks from the Archive.org catalog.

Parameters
  • identifier – The Archive.org identifier for which to retrieve tasks for.

  • params – The URL parameters to send with each request sent to the Archive.org catalog API.

Returns

A set of CatalogTask objects.