YouTube returns sometimes videos inside channel search results. As we only want
results corresponding to the type we requested, this commits makes
YoutubeSearchExtractor ignoring non-requested search results we get, using the
extractor LinkHandler's first content filter value.
Also remove an unneeded exception throwing declaration in
YoutubeSearchExtractor.
This query parameter for which its value is set to false was not added to two
requests made in test classes of YoutubeMixPlaylistExtractorTest.
Also remove an unneeded ParsingException exception throwing declaration in a
test method.
This should make returned dates consistent between timezones and countries on
which the extractor is ran.
It was previously only set on YouTube Music search continuations.
For every InnerTube request:
- Always add a `request` object with the following properties:
- "internalExperimentFlags" set to an empty array;
- "useSsl" set to "true";
- "lockedSafetyMode" set to "false".
- Use proper TODO comment to provide a way to enable restricted mode on every
request and add it on requests on which it wasn't present.
For YouTube Music:
- Remove alt query parameter, as it is not used anymore by the website;
- Add prettyPrint query parameter with false value on YouTube Music search
continuations.
Default image qualities were removed in image URLs with the jpg extension,
causing the addition of the image suffix to full non-JPG images URLs and so to
invalid image URLs.
Only the image quality name with its leading "-" character and the "."
character after the name is now removed and replaced by a string format
replaced itself with the image quality name for each quality.
As the image suffixes do not contain the image extension, the name of image
qualities lists has been adapted with these changes and some related comments
have been also improved.
Some services may provide different image formats using the same suffix,
without we know what format the service provide. Enforcing an image extension
could so lead to provide invalid image URLs, like for SoundCloud PNG images
currently.
With this documentation change, it is now clear that users of this class decide
of whether they want to include image extensions in the suffix. The previous
behavior described in the Javadoc was not enforced.
The signature timestamp is used as a number by HTML5 clients, so it should be
used in the same way by the extractor too instead of being a string.
As the timestamp doesn't seem to exceed 5 digits, an integer is used to store
its value.
This commit is introducing breaking changes.
For clients, everything is managed in a new class called
YoutubeJavaScriptPlayerManager:
- caching JavaScript base player code and its extracted code (functions and
variables);
- getting player signature timestamp;
- getting deobfuscated signatures of streaming URLs;
- getting streaming URLs with a throttling parameter deobfuscated, if
applicable.
The class delegates the extraction parts to external package-private classes:
- YoutubeJavaScriptExtractor, to extract and download YouTube's JavaScript base
player code: it always already present before and has been edited to mainly
remove the previous caching system and made it package-private;
- YoutubeSignatureUtils, for player signature timestamp and signature
deobfuscation function of streaming URLs, added in a recent commit;
- YoutubeThrottlingParameterUtils, which was originally
YoutubeThrottlingDecrypter, for throttling parameter of streaming URLs
deobfuscation function and checking whether this parameter is in a streaming
URL.
YoutubeJavaScriptPlayerManager caches and then runs the extracted code if it
has been executed successfully. The cache system of throttling parameters
deobfuscated values has been kept, its size can be get using the
getThrottlingParametersCacheSize method and can be cleared independently using
the clearThrottlingParametersCache method.
If an exception occurs during the extraction or the parsing of a function
property which is not related to JavaScript base player code fetching, it is
stored until caches are cleared, making subsequent failing extraction calls of
the requested function or property faster and consuming less resources, as the
result should be the same until the base player code changes.
All caches can be reset using the clearAllCaches method of
YoutubeJavaScriptPlayerManager.
Classes using JavaScript base player code and utilities directly (in the code
and its tests) have been also updated in this commit.
The goal of this class is to decouple the extraction of signature timestamp and
signature deobfuscation function from YoutubeStreamExtractor.
The extraction of the signature deobfuscation function has been also adapted to
support the latest YouTube player versions.
This new class, YoutubeSignatureUtils, doens't store anything temporary such as
a copy of the player code, which has to be passed where required. It is not
public, as it will be used by a JavaScript player manager class in the future,
in order to handle in a better way fetching, caching and resetting cache of the
player code.
Also remove some public test methods modifiers, add missing Test annotations on
old Junit 4 tests (and update them if needed), and use final in some places
where it was possible.
BandcampChannelExtractorTest.testLength has been removed as the test is always
true.
This method, testImages(Collection<Image>), will use first the default image
collection test in DefaultTests and then will check that each image URL
contains f4.bcbits.com/img and ends with .jpg or .png.
To do so, a new non-instantiable final class has been added: BandcampTestUtils.
Also remove some public test methods modifiers, add missing Test annotations on
old Junit 4 tests (and update them if needed), and use final in some places
where it was possible.
This method, testImages(Collection<Image>), will use first the default image
collection test in DefaultTests and then will check that each image URL
contains the string yt.
The JavaDoc of the class has been also updated to reflect the changes made in
it (it is now more general).
Two new methods have been added in ExtractorAsserts to check if a collection is
empty:
- assertNotEmpty(String, Collection<?>), checking:
- the non nullity of the collection;
- its non emptiness (if that's not case, an exception will be thrown using
the provided message).
- assertNotEmpty(Collection<?>), calling assertNotEmpty(String, Collection<?>)
with null as the value of the string argument.
A new one has been added to this assertion class to check the contrary:
assertEmpty(Collection<?>), checking emptiness of the collection only if it is
not null.
Three new methods have been added in ExtractorAsserts as utility test methods
for image collections:
- assertContainsImageUrlInImageCollection(String, Collection<Image>), checking
that:
- the provided URL and image collection are not null;
- the image collection contains at least one image which has the provided
string value as its URL (which is a string) property.
- assertContainsOnlyEquivalentImages(Collection<Image>, Collection<Image>),
checking that:
- both collections are not null;
- they have the same size;
- each image of the first collection has its equivalent in the second one.
This means that the properties of an image in the first collection must be
equal in an image of the second one.
- assertNotOnlyContainsEquivalentImages(Collection<Image>, Collection<Image>),
checking that:
- both collections are not null;
- one of the following conditions is met:
- they have different sizes;
- an image of the first collection has not its equivalent in the second one.
This means that the properties of an image in the first collection must
be not equal in an image of the second one.
These methods will be used by services extractors tests (and default ones) to
test image collections.
This new method, defaultTestImageList(List<Image), will check that the image
list is not null.
For each image, it will test that its URL is secure and its height and width
are more than or equal to their relevant unknown constants in the Image class
(HEIGHT_UNKNOWN and WIDTH_UNKNOWN).