2018-02-22 18:22:22 +00:00
<!DOCTYPE html>
<!-- [if IE 8]><html class="no - js lt - ie9" lang="en" > <![endif] -->
<!-- [if gt IE 8]><! --> < html class = "no-js" lang = "en" > <!-- <![endif] -->
< head >
< meta charset = "utf-8" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" >
< link rel = "shortcut icon" href = "../img/favicon.ico" >
2018-04-08 20:02:44 +00:00
< title > Concept of the Extractor - NewPipe Documentation< / title >
2018-09-08 17:06:35 +00:00
<!-- local fonts -->
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/local_fonts.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< link rel = "stylesheet" href = "../css/theme.css" type = "text/css" / >
< link rel = "stylesheet" href = "../css/theme_extra.css" type = "text/css" / >
2019-04-07 17:32:19 +00:00
< link rel = "stylesheet" href = "../css/theme_child.css" type = "text/css" / >
2018-09-08 17:06:35 +00:00
<!-- local code syntax highlighting -->
< link rel = "stylesheet" href = "../css/github.min.css" type = "text/css" / >
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/highlight.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< script >
// Current page data
2018-02-24 21:17:40 +00:00
var mkdocs_page_name = "Concept of the Extractor";
var mkdocs_page_input_path = "01_Concept_of_the_extractor.md";
2018-09-01 13:48:12 +00:00
var mkdocs_page_url = null;
2018-02-22 18:22:22 +00:00
< / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/jquery-2.1.1.min.js" defer > < / script >
< script src = "../js/modernizr-2.8.3.min.js" defer > < / script >
2018-09-08 17:06:35 +00:00
< script src = "../js/highlight.min.js" > < / script >
2018-09-01 13:48:12 +00:00
< script > hljs . initHighlightingOnLoad ( ) ; < / script >
2018-02-22 18:22:22 +00:00
< / head >
< body class = "wy-body-for-nav" role = "document" >
< div class = "wy-grid-for-nav" >
< nav data-toggle = "wy-nav-shift" class = "wy-nav-side stickynav" >
< div class = "wy-side-nav-search" >
2018-04-08 20:02:44 +00:00
< a href = ".." class = "icon icon-home" > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< div role = "search" >
< form id = "rtd-search-form" class = "wy-form" action = "../search.html" method = "get" >
2018-09-01 13:48:12 +00:00
< input type = "text" name = "q" placeholder = "Search docs" title = "Type search term here" / >
2018-02-22 18:22:22 +00:00
< / form >
< / div >
< / div >
< div class = "wy-menu wy-menu-vertical" data-spy = "affix" role = "navigation" aria-label = "main navigation" >
< ul class = "current" >
< li class = "toctree-l1" >
2020-03-01 21:17:35 +00:00
< a class = "" href = ".." > Welcome to NewPipe< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../00_Prepare_everything/" > Before You Start< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1 current" >
2018-02-24 21:17:40 +00:00
< a class = "current" href = "./" > Concept of the Extractor< / a >
2018-02-22 18:22:22 +00:00
< ul class = "subnav" >
2018-02-24 21:17:40 +00:00
< li class = "toctree-l2" > < a href = "#concept-of-the-extractor" > Concept of the Extractor< / a > < / li >
2018-02-22 18:22:22 +00:00
< ul >
2019-03-01 09:02:33 +00:00
< li > < a class = "toctree-l3" href = "#the-collectorextractor-pattern" > The Collector/Extractor Pattern< / a > < / li >
2018-02-22 18:22:22 +00:00
2019-03-01 09:02:33 +00:00
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern-for-lists" > Collector/Extractor Pattern for Lists< / a > < / li >
2018-02-24 21:17:40 +00:00
2019-07-02 12:24:37 +00:00
< li > < a class = "toctree-l3" href = "#listextractor" > ListExtractor< / a > < / li >
2018-03-26 06:47:05 +00:00
2018-02-22 18:22:22 +00:00
< / ul >
< / ul >
< / li >
2018-09-01 13:48:12 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../02_Concept_of_LinkHandler/" > Concept of the LinkHandler< / a >
2018-09-01 13:48:12 +00:00
< / li >
2018-09-09 15:02:52 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../03_Implement_a_service/" > Implementing a Service< / a >
2018-09-11 18:21:55 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../04_Run_changes_in_App/" > Testing Your Changes in the App< / a >
2018-09-09 15:02:52 +00:00
< / li >
2019-01-12 15:30:56 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../05_releasing/" > Releasing a New NewPipe Version< / a >
2019-02-20 18:03:04 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../06_documentation/" > About This Documentation< / a >
2019-01-12 15:30:56 +00:00
< / li >
2019-03-21 21:33:25 +00:00
< li class = "toctree-l1" >
2019-08-17 21:40:01 +00:00
< a class = "" href = "../07_maintainers_view/" > Maintainers' Section< / a >
2019-03-21 21:33:25 +00:00
< / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / div >
< / nav >
< section data-toggle = "wy-nav-shift" class = "wy-nav-content-wrap" >
< nav class = "wy-nav-top" role = "navigation" aria-label = "top navigation" >
< i data-toggle = "wy-nav-top" class = "fa fa-bars" > < / i >
2018-04-08 20:02:44 +00:00
< a href = ".." > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< / nav >
< div class = "wy-nav-content" >
< div class = "rst-content" >
< div role = "navigation" aria-label = "breadcrumbs navigation" >
< ul class = "wy-breadcrumbs" >
< li > < a href = ".." > Docs< / a > » < / li >
2018-02-24 21:17:40 +00:00
< li > Concept of the Extractor< / li >
2018-02-22 18:22:22 +00:00
< li class = "wy-breadcrumbs-aside" >
< / li >
< / ul >
< hr / >
< / div >
< div role = "main" >
< div class = "section" >
2018-02-24 21:17:40 +00:00
< h1 id = "concept-of-the-extractor" > Concept of the Extractor< / h1 >
2019-03-01 09:02:33 +00:00
< h2 id = "the-collectorextractor-pattern" > The Collector/Extractor Pattern< / h2 >
< p > Before you start coding your own service, you need to understand the basic concept of the extractor itself. There is a pattern
you will find all over the code, called the < strong > extractor/collector< / strong > pattern. The idea behind it is that
2018-02-22 18:22:22 +00:00
the < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > extractor< / a >
2019-03-01 09:02:33 +00:00
would produce fragments of data, and the collector would collect them and assemble that data into a readable format for the front end.
The collector also controls the parsing process, and takes care of error handling. So, if the extractor fails at any
2018-12-14 08:57:47 +00:00
point, the collector will decide whether or not it should continue parsing. This requires the extractor to be made out of
2019-03-01 09:02:33 +00:00
multiple methods, one method for every data field the collector wants to have. The collectors are provided by NewPipe.
2018-02-22 18:22:22 +00:00
You need to take care of the extractors.< / p >
2019-03-01 09:02:33 +00:00
< h3 id = "usage-in-the-front-end" > Usage in the Front End< / h3 >
< p > A typical call for retrieving data from a website would look like this:< / p >
2018-02-22 18:22:22 +00:00
< pre > < code class = "java" > Info info;
try {
2018-02-23 20:18:58 +00:00
// Create a new Extractor with a given context provided as parameter.
2018-02-24 21:17:40 +00:00
Extractor extractor = new Extractor(some_meta_info);
2018-02-23 20:18:58 +00:00
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
2018-02-22 18:22:22 +00:00
} catch(Exception e) {
2018-02-23 20:18:58 +00:00
// handle errors when collector decided to break up extraction
2018-02-22 18:22:22 +00:00
}
< / code > < / pre >
2018-02-24 21:17:40 +00:00
2019-03-01 09:02:33 +00:00
< h3 id = "typical-implementation-of-a-single-data-extractor" > Typical Implementation of a Single Data Extractor< / h3 >
< p > The typical implementation of a single data extractor, on the other hand, would look like this:< / p >
2018-02-24 21:17:40 +00:00
< pre > < code class = "java" > class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
...
}
@Override
public void fetch() {
// Actually fetch the page data here
}
@Override
public String someDataFiled()
2020-03-01 21:17:35 +00:00
throws ExtractionException { //The exception needs to be thrown if something failed
2018-02-24 21:17:40 +00:00
// get piece of information and return it
}
... // More datafields
}
< / code > < / pre >
2019-03-01 09:02:33 +00:00
< h2 id = "collectorextractor-pattern-for-lists" > Collector/Extractor Pattern for Lists< / h2 >
2020-03-01 21:17:35 +00:00
< p > Information can be represented as a list. In NewPipe, a list is represented by an
2018-03-26 06:47:05 +00:00
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html" > InfoItemsCollector< / a > .
2020-03-01 21:17:35 +00:00
An InfoItemsCollector will collect and assemble a list of < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html" > InfoItem< / a > .
2019-07-02 12:24:37 +00:00
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemsCollector via < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-" > commit()< / a > .< / p >
2018-11-16 18:23:26 +00:00
< p > < img alt = "InfoItemsCollector_objectdiagram.svg" src = "../img/InfoItemsCollector_objectdiagram.svg" / > < / p >
2019-07-02 12:24:37 +00:00
< p > If you are implementing a list in your service you need to implement an < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > InfoItemExtractor< / a > ,
2020-01-19 09:24:21 +00:00
that will be able to retrieve data for one and only one InfoItem. This extractor will then be < em > comitted< / em > to the < strong > InfoItemsCollector< / strong > that can collect the type of InfoItems you want to generate.< / p >
2019-03-01 09:02:33 +00:00
< p > A common implementation would look like this:< / p >
2019-07-02 12:24:37 +00:00
< pre > < code > private SomeInfoItemCollector collectInfoItemsFromElement(Element e) {
// See *Some* as something like Stream or Channel
// e.g. StreamInfoItemsCollector, and ChannelInfoItemsCollector are provided by NP
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
2018-03-26 06:47:05 +00:00
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return collector;
}
< / code > < / pre >
2019-07-02 12:24:37 +00:00
< h2 id = "listextractor" > ListExtractor< / h2 >
< p > There is more to know about lists:< / p >
< ol >
< li >
2019-03-01 09:02:33 +00:00
< p > When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
and its creator. Such info can be called < strong > list header< / strong > .< / p >
2019-07-02 12:24:37 +00:00
< / li >
< li >
< p > When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.< / p >
< / li >
< / ol >
2020-01-19 09:24:21 +00:00
< p > Both of these Problems are fixed by the < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html" > ListExtractor< / a > which takes care about extracting additional metadata about the list,
2019-07-02 12:24:37 +00:00
and by chopping down lists into several pages, so called < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > InfoItemsPage< / a > s.
Each page has its own URL, and needs to be extracted separately.< / p >
< p > For extracting list header information a < code > ListExtractor< / code > behaves like a regular extractor. For handling < code > InfoItemsPages< / code > it adds methods
2018-03-26 06:47:05 +00:00
such as:< / p >
< ul >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--" > getInitialPage()< / a >
which will return the first page of InfoItems.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl--" > getNextPageUrl()< / a >
If a second Page of InfoItems is available this will return the URL pointing to them.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String-" > getPage()< / a >
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the < code > getNextPageUrl()< / code > method of the previous page.< / li >
< / ul >
2018-09-21 20:40:35 +00:00
< p > The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
2019-03-01 09:02:33 +00:00
items like a regular web page, but all the others as an AJAX request.< / p >
2019-07-02 12:24:37 +00:00
< p > An InfoItemsPage itself has two constructors which take these parameters:
- The < strong > InfoitemsCollector< / strong > of the list that the page should represent
- A < strong > nextPageUrl< / strong > which represents the url of the following page (may be null if not page follows).
- Optionally < strong > errors< / strong > which is a list of Exceptions that may have happened during extracton.< / p >
< p > Here is a simplified reference implementation of a list extractor that only extracts pages, but not metadata:< / p >
< pre > < code > class MyListExtractor extends ListExtractor {
...
private Document document;
...
public InfoItemsPage< SomeInfoItem> getPage(pageUrl)
throws ExtractionException {
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
document = myFunctionToGetThePageHTMLWhatever(pageUrl);
//remember this part from the simple list extraction
for(final Element li : document.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return new InfoItemsPage< SomeInfoItem> (collector, myFunctionToGetTheNextPageUrl(document));
}
public InfoItemsPage< SomeInfoItem> getInitialPage() {
2020-03-01 21:17:35 +00:00
//document here got initialized by the fetch() function.
2019-07-02 12:24:37 +00:00
return getPage(getTheCurrentPageUrl(document));
}
...
}
< / code > < / pre >
2018-02-22 18:22:22 +00:00
< / div >
< / div >
< footer >
< div class = "rst-footer-buttons" role = "navigation" aria-label = "footer navigation" >
2019-03-01 09:02:33 +00:00
< a href = "../02_Concept_of_LinkHandler/" class = "btn btn-neutral float-right" title = "Concept of the LinkHandler" > Next < span class = "icon icon-circle-arrow-right" > < / span > < / a >
2018-09-01 13:48:12 +00:00
2018-02-22 18:22:22 +00:00
2019-03-01 09:02:33 +00:00
< a href = "../00_Prepare_everything/" class = "btn btn-neutral" title = "Before You Start" > < span class = "icon icon-circle-arrow-left" > < / span > Previous< / a >
2018-02-22 18:22:22 +00:00
< / div >
< hr / >
< div role = "contentinfo" >
<!-- Copyright etc -->
< / div >
Built with < a href = "http://www.mkdocs.org" > MkDocs< / a > using a < a href = "https://github.com/snide/sphinx_rtd_theme" > theme< / a > provided by < a href = "https://readthedocs.org" > Read the Docs< / a > .
< / footer >
< / div >
< / div >
< / section >
< / div >
< div class = "rst-versions" role = "note" style = "cursor: pointer" >
< span class = "rst-current-version" data-toggle = "rst-current-version" >
< span > < a href = "../00_Prepare_everything/" style = "color: #fcfcfc;" > « Previous< / a > < / span >
2018-09-01 13:48:12 +00:00
< span style = "margin-left: 15px" > < a href = "../02_Concept_of_LinkHandler/" style = "color: #fcfcfc" > Next » < / a > < / span >
2018-02-22 18:22:22 +00:00
< / span >
< / div >
< script > var base _url = '..' ; < / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/theme.js" defer > < / script >
< script src = "../search/main.js" defer > < / script >
2018-02-22 18:22:22 +00:00
< / body >
< / html >