2018-02-24 21:18:12 +00:00
# Concept of the Extractor
2019-02-28 06:57:11 +00:00
## The Collector/Extractor Pattern
2018-02-24 21:18:12 +00:00
2019-02-28 06:57:11 +00:00
Before you start coding your own service, you need to understand the basic concept of the extractor itself. There is a pattern
you will find all over the code, called the __extractor/collector__ pattern. The idea behind it is that
2018-02-24 21:18:12 +00:00
the [extractor ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html )
2019-02-28 06:57:11 +00:00
would produce fragments of data, and the collector would collect them and assemble that data into a readable format for the front end.
The collector also controls the parsing process, and takes care of error handling. So, if the extractor fails at any
2018-12-12 02:14:05 +00:00
point, the collector will decide whether or not it should continue parsing. This requires the extractor to be made out of
2019-02-28 06:57:11 +00:00
multiple methods, one method for every data field the collector wants to have. The collectors are provided by NewPipe.
2018-02-24 21:18:12 +00:00
You need to take care of the extractors.
2019-02-28 06:57:11 +00:00
### Usage in the Front End
2018-02-24 21:18:12 +00:00
2019-02-28 06:57:11 +00:00
A typical call for retrieving data from a website would look like this:
2018-09-01 13:47:57 +00:00
``` java
2018-02-24 21:18:12 +00:00
Info info;
try {
// Create a new Extractor with a given context provided as parameter.
Extractor extractor = new Extractor(some_meta_info);
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
} catch(Exception e) {
// handle errors when collector decided to break up extraction
}
```
2019-02-28 06:57:11 +00:00
### Typical Implementation of a Single Data Extractor
2018-02-24 21:18:12 +00:00
2019-02-28 06:57:11 +00:00
The typical implementation of a single data extractor, on the other hand, would look like this:
2018-09-01 13:47:57 +00:00
``` java
2018-02-24 21:18:12 +00:00
class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
...
}
@Override
public void fetch() {
// Actually fetch the page data here
}
@Override
public String someDataFiled()
2020-02-19 11:14:18 +00:00
throws ExtractionException { //The exception needs to be thrown if something failed
2018-02-24 21:18:12 +00:00
// get piece of information and return it
}
... // More datafields
}
```
2019-02-28 06:57:11 +00:00
## Collector/Extractor Pattern for Lists
2018-02-24 21:18:12 +00:00
2020-02-19 11:14:18 +00:00
Information can be represented as a list. In NewPipe, a list is represented by an
2018-03-26 06:47:15 +00:00
[InfoItemsCollector ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html ).
2020-02-19 11:14:18 +00:00
An InfoItemsCollector will collect and assemble a list of [InfoItem ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html ).
2019-07-02 12:23:20 +00:00
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemsCollector via [commit() ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E- ).
2018-02-24 21:18:12 +00:00
2018-11-16 18:23:01 +00:00
![InfoItemsCollector_objectdiagram.svg ](img/InfoItemsCollector_objectdiagram.svg )
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
If you are implementing a list in your service you need to implement an [InfoItemExtractor ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html ),
2020-01-19 09:02:05 +00:00
that will be able to retrieve data for one and only one InfoItem. This extractor will then be _comitted_ to the __InfoItemsCollector__ that can collect the type of InfoItems you want to generate.
2018-03-26 06:47:15 +00:00
2019-02-28 06:57:11 +00:00
A common implementation would look like this:
2018-03-26 06:47:15 +00:00
```
2019-07-02 12:23:20 +00:00
private SomeInfoItemCollector collectInfoItemsFromElement(Element e) {
// See *Some* as something like Stream or Channel
// e.g. StreamInfoItemsCollector, and ChannelInfoItemsCollector are provided by NP
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
2018-03-26 06:47:15 +00:00
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return collector;
}
```
2019-07-02 12:23:20 +00:00
## ListExtractor
2018-03-26 06:47:15 +00:00
2019-07-02 12:23:20 +00:00
There is more to know about lists:
1. When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
2019-02-28 06:57:11 +00:00
and its creator. Such info can be called __list header__ .
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
2. When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.
2018-03-26 06:47:15 +00:00
2020-01-19 09:02:05 +00:00
Both of these Problems are fixed by the [ListExtractor ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html ) which takes care about extracting additional metadata about the list,
2019-07-02 12:23:20 +00:00
and by chopping down lists into several pages, so called [InfoItemsPage ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html )s.
Each page has its own URL, and needs to be extracted separately.
2018-03-26 06:47:15 +00:00
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
For extracting list header information a `ListExtractor` behaves like a regular extractor. For handling `InfoItemsPages` it adds methods
2018-03-26 06:47:15 +00:00
such as:
2018-02-24 21:18:12 +00:00
2018-03-26 06:47:15 +00:00
- [getInitialPage() ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage-- )
which will return the first page of InfoItems.
- [getNextPageUrl() ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl-- )
If a second Page of InfoItems is available this will return the URL pointing to them.
- [getPage() ](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String- )
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the `getNextPageUrl()` method of the previous page.
2018-02-24 21:18:12 +00:00
2018-09-15 19:01:28 +00:00
The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
2019-02-28 06:57:11 +00:00
items like a regular web page, but all the others as an AJAX request.
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
An InfoItemsPage itself has two constructors which take these parameters:
- The __InfoitemsCollector__ of the list that the page should represent
- A __nextPageUrl__ which represents the url of the following page (may be null if not page follows).
- Optionally __errors__ which is a list of Exceptions that may have happened during extracton.
Here is a simplified reference implementation of a list extractor that only extracts pages, but not metadata:
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
```
class MyListExtractor extends ListExtractor {
...
private Document document;
...
public InfoItemsPage< SomeInfoItem > getPage(pageUrl)
throws ExtractionException {
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
document = myFunctionToGetThePageHTMLWhatever(pageUrl);
//remember this part from the simple list extraction
for(final Element li : document.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return new InfoItemsPage< SomeInfoItem > (collector, myFunctionToGetTheNextPageUrl(document));
}
2018-02-24 21:18:12 +00:00
2019-07-02 12:23:20 +00:00
public InfoItemsPage< SomeInfoItem > getInitialPage() {
2020-02-19 11:14:18 +00:00
//document here got initialized by the fetch() function.
2019-07-02 12:23:20 +00:00
return getPage(getTheCurrentPageUrl(document));
}
...
}
```