Skip to main content

View Post [edit]

Poster: Albretch Date: Oct 11, 2023 10:12am
Forum: texts Subject: Re: 20 most viewed texts (all-time views) for every language? ...

because by following your recommendations you won’t get (at least I am not) the result set that I need. Have you tried yourself what you just mentioned?
As you scroll down their javascript-based view by view cr@p they stop serving pages at some point. Try a few languages (around 10) and start scrolling down their pages and you will see what I mean.
IA apparently see themselves as an online library serving the need of individuals (as if they were a brick-and-mortar library).
Serving the needs of NLP, corpora research folks shouldn’t be that hard, but they don’t seem to care. In fact, apparently they actively obfuscate the kind of access such folks need.

Reply [edit]

Poster: MPDMedia Date: Oct 17, 2023 6:50am
Forum: texts Subject: Re: 20 most viewed texts (all-time views) for every language? ...

Speaking as a longtime member of Archive, I think you're asking way too much of this organization. They have millions and millions of files and pages of data to contend with and a fairly small staff to deal with it on what sounds like a tight budget. If you want what you need, perhaps you should volunteer your spare time to help solve issues instead of kvetching about them, or perhaps donate some cash for them to hire more staff. I can't afford it, but maybe you can, or know people who have that kind of income.

Reply [edit]

Poster: Jeff Kaplan Date: Oct 11, 2023 11:28pm
Forum: texts Subject: Re: 20 most viewed texts (all-time views) for every language? ...

an example for spanish:
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&sort=-downloads

Reply [edit]

Poster: Albretch Date: Oct 12, 2023 3:37am
Forum: texts Subject: Re: 20 most viewed texts (all-time views) for every language? ...

I meant to use that query with more than one language. Where is a link to the list of all languages and their accronyms IA uses?
Since I only need public domain texts, you mean:
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&sort=-downloads&and%5B%5D=lending%3A%22is_readable%22&and%5B%5D=mediatype%3A%22texts%22
then you see: 64,946 results; but, as you scroll, on page 201:
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&page=201&sort=-downloads&and%5B%5D=lending%3A%22is_readable%22&and%5B%5D=mediatype%3A%22texts%22
you will be served with a page telling you:
"The search engine encountered an error, which might be related to your search query. Tips for constructing search queries (https://help.archive.org/help/search-building-powerful-complex-queries/)."
~
More importantly:
1) IA is "intelligently" obfuscating their pages by using some visual javascript back-end populated view-by-view graphql cr@p which is not part of the w3 standard, so you are effectively assuming that your users will then click on every link of each visual thing you show to them. That obfuscation is silly anyway, because anyone can use chrome or a chromium-based browser to get the network logs as a HAR file …;
2) As if you were a brick-and-mortar library, you are assuming (and/or enforcing) that all your users come here with one book in mind. NLP, corpora-research folks need the data you keep in different ways. All such sites I know of: https://www.gutenberg.org/ebooks/offline_catalogs.html, wikipedia.org, ted.com/talks ... offer their metadata (what you show on the "details" page for every IA "identifier") in someway for more technical folks to be able to strategize their research a bit better instead of having to absurdly deal with each visual thing by maddeningly click and click and click ... your way through it;
3) That SQL-like query I mentioned, shouldn't be that hard to run on your back-end data, or, better yet;
4) you could offer on a monthly basis an updated compressed tar ball of text files for all your extant metadata as texts containing ordered field-separated one liners (identifier|year|language(s)|title|size|…) (of course, excluding the data/texts themselves!) for each type of media (stratifying that data and making links refer to the nearest server depending on the user would make things more efficient for everyone);
5) you should separate §4 for each media type in public domain and copyrighted material.
Say each line contains approximately 64 bytes, so the uncompressed file would be: 3,300,000*64 = 211,200,000 bytes long which would compress to approx.: 70 Mbs. Why is this so difficult for IA to understand?
I have been raising such points for a long time as I have also seen other people do.
Thank you,
lbrtchx