View Post [edit]
Poster: | Albretch | Date: | Oct 11, 2023 10:12am |
Forum: | texts | Subject: | Re: 20 most viewed texts (all-time views) for every language? ... |
As you scroll down their javascript-based view by view cr@p they stop serving pages at some point. Try a few languages (around 10) and start scrolling down their pages and you will see what I mean.
IA apparently see themselves as an online library serving the need of individuals (as if they were a brick-and-mortar library).
Serving the needs of NLP, corpora research folks shouldn’t be that hard, but they don’t seem to care. In fact, apparently they actively obfuscate the kind of access such folks need.
Reply [edit]
Poster: | MPDMedia | Date: | Oct 17, 2023 6:50am |
Forum: | texts | Subject: | Re: 20 most viewed texts (all-time views) for every language? ... |
Reply [edit]
Poster: | Jeff Kaplan | Date: | Oct 11, 2023 11:28pm |
Forum: | texts | Subject: | Re: 20 most viewed texts (all-time views) for every language? ... |
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&sort=-downloads
Reply [edit]
Poster: | Albretch | Date: | Oct 12, 2023 3:37am |
Forum: | texts | Subject: | Re: 20 most viewed texts (all-time views) for every language? ... |
Since I only need public domain texts, you mean:
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&sort=-downloads&and%5B%5D=lending%3A%22is_readable%22&and%5B%5D=mediatype%3A%22texts%22
then you see: 64,946 results; but, as you scroll, on page 201:
https://archive.org/search?query=mediatype%3Atexts+AND+language%3A%28spa+OR+Spanish%29&page=201&sort=-downloads&and%5B%5D=lending%3A%22is_readable%22&and%5B%5D=mediatype%3A%22texts%22
you will be served with a page telling you:
"The search engine encountered an error, which might be related to your search query. Tips for constructing search queries (https://help.archive.org/help/search-building-powerful-complex-queries/)."
~
More importantly:
1) IA is "intelligently" obfuscating their pages by using some visual javascript back-end populated view-by-view graphql cr@p which is not part of the w3 standard, so you are effectively assuming that your users will then click on every link of each visual thing you show to them. That obfuscation is silly anyway, because anyone can use chrome or a chromium-based browser to get the network logs as a HAR file …;
2) As if you were a brick-and-mortar library, you are assuming (and/or enforcing) that all your users come here with one book in mind. NLP, corpora-research folks need the data you keep in different ways. All such sites I know of: https://www.gutenberg.org/ebooks/offline_catalogs.html, wikipedia.org, ted.com/talks ... offer their metadata (what you show on the "details" page for every IA "identifier") in someway for more technical folks to be able to strategize their research a bit better instead of having to absurdly deal with each visual thing by maddeningly click and click and click ... your way through it;
3) That SQL-like query I mentioned, shouldn't be that hard to run on your back-end data, or, better yet;
4) you could offer on a monthly basis an updated compressed tar ball of text files for all your extant metadata as texts containing ordered field-separated one liners (identifier|year|language(s)|title|size|…) (of course, excluding the data/texts themselves!) for each type of media (stratifying that data and making links refer to the nearest server depending on the user would make things more efficient for everyone);
5) you should separate §4 for each media type in public domain and copyrighted material.
Say each line contains approximately 64 bytes, so the uncompressed file would be: 3,300,000*64 = 211,200,000 bytes long which would compress to approx.: 70 Mbs. Why is this so difficult for IA to understand?
I have been raising such points for a long time as I have also seen other people do.
Thank you,
lbrtchx