2018-02-22 18:22:22 +00:00
<!DOCTYPE html>
<!-- [if IE 8]><html class="no - js lt - ie9" lang="en" > <![endif] -->
<!-- [if gt IE 8]><! --> < html class = "no-js" lang = "en" > <!-- <![endif] -->
< head >
< meta charset = "utf-8" >
< meta http-equiv = "X-UA-Compatible" content = "IE=edge" >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0" >
< link rel = "shortcut icon" href = "../img/favicon.ico" >
2018-04-08 20:02:44 +00:00
< title > Concept of the Extractor - NewPipe Documentation< / title >
2018-09-08 17:06:35 +00:00
<!-- local fonts -->
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/local_fonts.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< link rel = "stylesheet" href = "../css/theme.css" type = "text/css" / >
< link rel = "stylesheet" href = "../css/theme_extra.css" type = "text/css" / >
2018-09-08 17:06:35 +00:00
<!-- local code syntax highlighting -->
< link rel = "stylesheet" href = "../css/github.min.css" type = "text/css" / >
2018-02-23 20:18:58 +00:00
< link rel = "stylesheet" href = "../css/highlight.css" type = "text/css" / >
2018-02-22 18:22:22 +00:00
< script >
// Current page data
2018-02-24 21:17:40 +00:00
var mkdocs_page_name = "Concept of the Extractor";
var mkdocs_page_input_path = "01_Concept_of_the_extractor.md";
2018-09-01 13:48:12 +00:00
var mkdocs_page_url = null;
2018-02-22 18:22:22 +00:00
< / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/jquery-2.1.1.min.js" defer > < / script >
< script src = "../js/modernizr-2.8.3.min.js" defer > < / script >
2018-09-08 17:06:35 +00:00
< script src = "../js/highlight.min.js" > < / script >
2018-09-01 13:48:12 +00:00
< script > hljs . initHighlightingOnLoad ( ) ; < / script >
2018-02-22 18:22:22 +00:00
< / head >
< body class = "wy-body-for-nav" role = "document" >
< div class = "wy-grid-for-nav" >
< nav data-toggle = "wy-nav-shift" class = "wy-nav-side stickynav" >
< div class = "wy-side-nav-search" >
2018-04-08 20:02:44 +00:00
< a href = ".." class = "icon icon-home" > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< div role = "search" >
< form id = "rtd-search-form" class = "wy-form" action = "../search.html" method = "get" >
2018-09-01 13:48:12 +00:00
< input type = "text" name = "q" placeholder = "Search docs" title = "Type search term here" / >
2018-02-22 18:22:22 +00:00
< / form >
< / div >
< / div >
< div class = "wy-menu wy-menu-vertical" data-spy = "affix" role = "navigation" aria-label = "main navigation" >
< ul class = "current" >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = ".." > Welcome to NewPipe.< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../00_Prepare_everything/" > Before You Start< / a >
2018-02-22 18:22:22 +00:00
< / li >
< li class = "toctree-l1 current" >
2018-02-24 21:17:40 +00:00
< a class = "current" href = "./" > Concept of the Extractor< / a >
2018-02-22 18:22:22 +00:00
< ul class = "subnav" >
2018-02-24 21:17:40 +00:00
< li class = "toctree-l2" > < a href = "#concept-of-the-extractor" > Concept of the Extractor< / a > < / li >
2018-02-22 18:22:22 +00:00
< ul >
2019-03-01 09:02:33 +00:00
< li > < a class = "toctree-l3" href = "#the-collectorextractor-pattern" > The Collector/Extractor Pattern< / a > < / li >
2018-02-22 18:22:22 +00:00
2019-03-01 09:02:33 +00:00
< li > < a class = "toctree-l3" href = "#collectorextractor-pattern-for-lists" > Collector/Extractor Pattern for Lists< / a > < / li >
2018-02-24 21:17:40 +00:00
2019-03-01 09:02:33 +00:00
< li > < a class = "toctree-l3" href = "#infoitems-encapsulated-in-pages" > InfoItems Encapsulated in Pages< / a > < / li >
2018-03-26 06:47:05 +00:00
2018-02-22 18:22:22 +00:00
< / ul >
< / ul >
< / li >
2018-09-01 13:48:12 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../02_Concept_of_LinkHandler/" > Concept of the LinkHandler< / a >
2018-09-01 13:48:12 +00:00
< / li >
2018-09-09 15:02:52 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../03_Implement_a_service/" > Implementing a Service< / a >
2018-09-11 18:21:55 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../04_Run_changes_in_App/" > Testing Your Changes in the App< / a >
2018-09-09 15:02:52 +00:00
< / li >
2019-01-12 15:30:56 +00:00
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../05_releasing/" > Releasing a New NewPipe Version< / a >
2019-02-20 18:03:04 +00:00
< / li >
< li class = "toctree-l1" >
2019-03-01 09:02:33 +00:00
< a class = "" href = "../06_documentation/" > About This Documentation< / a >
2019-01-12 15:30:56 +00:00
< / li >
2019-03-21 21:33:25 +00:00
< li class = "toctree-l1" >
< a class = "" href = "../maintainers_view_07/" > Maintainers View< / a >
< / li >
2018-02-22 18:22:22 +00:00
< / ul >
< / div >
< / nav >
< section data-toggle = "wy-nav-shift" class = "wy-nav-content-wrap" >
< nav class = "wy-nav-top" role = "navigation" aria-label = "top navigation" >
< i data-toggle = "wy-nav-top" class = "fa fa-bars" > < / i >
2018-04-08 20:02:44 +00:00
< a href = ".." > NewPipe Documentation< / a >
2018-02-22 18:22:22 +00:00
< / nav >
< div class = "wy-nav-content" >
< div class = "rst-content" >
< div role = "navigation" aria-label = "breadcrumbs navigation" >
< ul class = "wy-breadcrumbs" >
< li > < a href = ".." > Docs< / a > » < / li >
2018-02-24 21:17:40 +00:00
< li > Concept of the Extractor< / li >
2018-02-22 18:22:22 +00:00
< li class = "wy-breadcrumbs-aside" >
< / li >
< / ul >
< hr / >
< / div >
< div role = "main" >
< div class = "section" >
2018-02-24 21:17:40 +00:00
< h1 id = "concept-of-the-extractor" > Concept of the Extractor< / h1 >
2019-03-01 09:02:33 +00:00
< h2 id = "the-collectorextractor-pattern" > The Collector/Extractor Pattern< / h2 >
< p > Before you start coding your own service, you need to understand the basic concept of the extractor itself. There is a pattern
you will find all over the code, called the < strong > extractor/collector< / strong > pattern. The idea behind it is that
2018-02-22 18:22:22 +00:00
the < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > extractor< / a >
2019-03-01 09:02:33 +00:00
would produce fragments of data, and the collector would collect them and assemble that data into a readable format for the front end.
The collector also controls the parsing process, and takes care of error handling. So, if the extractor fails at any
2018-12-14 08:57:47 +00:00
point, the collector will decide whether or not it should continue parsing. This requires the extractor to be made out of
2019-03-01 09:02:33 +00:00
multiple methods, one method for every data field the collector wants to have. The collectors are provided by NewPipe.
2018-02-22 18:22:22 +00:00
You need to take care of the extractors.< / p >
2019-03-01 09:02:33 +00:00
< h3 id = "usage-in-the-front-end" > Usage in the Front End< / h3 >
< p > A typical call for retrieving data from a website would look like this:< / p >
2018-02-22 18:22:22 +00:00
< pre > < code class = "java" > Info info;
try {
2018-02-23 20:18:58 +00:00
// Create a new Extractor with a given context provided as parameter.
2018-02-24 21:17:40 +00:00
Extractor extractor = new Extractor(some_meta_info);
2018-02-23 20:18:58 +00:00
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
2018-02-22 18:22:22 +00:00
} catch(Exception e) {
2018-02-23 20:18:58 +00:00
// handle errors when collector decided to break up extraction
2018-02-22 18:22:22 +00:00
}
< / code > < / pre >
2018-02-24 21:17:40 +00:00
2019-03-01 09:02:33 +00:00
< h3 id = "typical-implementation-of-a-single-data-extractor" > Typical Implementation of a Single Data Extractor< / h3 >
< p > The typical implementation of a single data extractor, on the other hand, would look like this:< / p >
2018-02-24 21:17:40 +00:00
< pre > < code class = "java" > class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
...
}
@Override
public void fetch() {
// Actually fetch the page data here
}
@Override
public String someDataFiled()
throws ExtractionException { //The exception needs to be thrown if someting failed
// get piece of information and return it
}
... // More datafields
}
< / code > < / pre >
2019-03-01 09:02:33 +00:00
< h2 id = "collectorextractor-pattern-for-lists" > Collector/Extractor Pattern for Lists< / h2 >
< p > Information can be represented as a list. In NewPipe, a list is represented by a
2018-03-26 06:47:05 +00:00
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html" > InfoItemsCollector< / a > .
A InfoItemCollector will collect and assemble a list of < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html" > InfoItem< / a > .
2019-03-01 09:02:33 +00:00
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemCollector via < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-" > commit()< / a > .< / p >
2018-11-16 18:23:26 +00:00
< p > < img alt = "InfoItemsCollector_objectdiagram.svg" src = "../img/InfoItemsCollector_objectdiagram.svg" / > < / p >
2019-03-01 09:02:33 +00:00
< p > If you are implementing a list for your service you need to extend InfoItem containing the extracted information
and implement an < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html" > InfoItemExtractor< / a > ,
2018-03-26 06:47:05 +00:00
that will return the data of one InfoItem.< / p >
2019-03-01 09:02:33 +00:00
< p > A common implementation would look like this:< / p >
2018-03-26 06:47:05 +00:00
< pre > < code > private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
MyInfoItemCollector collector = new MyInfoItemCollector(getServiceId());
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return collector;
}
< / code > < / pre >
2019-03-01 09:02:33 +00:00
< h2 id = "infoitems-encapsulated-in-pages" > InfoItems Encapsulated in Pages< / h2 >
< p > When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
and its creator. Such info can be called < strong > list header< / strong > .< / p >
2018-03-26 06:47:05 +00:00
< p > When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down. < / p >
< p > This is why a list in NewPipe lists are chopped down into smaller lists called < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > InfoItemsPage< / a > s. Each page has its own URL, and needs to be extracted separately.< / p >
2019-03-01 09:02:33 +00:00
< p > Additional metadata about the list and extracting multiple pages can be handled by a
2018-03-26 06:47:05 +00:00
< a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html" > ListExtractor< / a > ,
2019-03-01 09:02:33 +00:00
and its < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html" > ListExtractor.InfoItemsPage< / a > .< / p >
2018-03-26 06:47:05 +00:00
< p > For extracting list header information it behaves like a regular extractor. For handling < code > InfoItemsPages< / code > it adds methods
such as:< / p >
< ul >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--" > getInitialPage()< / a >
which will return the first page of InfoItems.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl--" > getNextPageUrl()< / a >
If a second Page of InfoItems is available this will return the URL pointing to them.< / li >
< li > < a href = "https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String-" > getPage()< / a >
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the < code > getNextPageUrl()< / code > method of the previous page.< / li >
< / ul >
2018-09-21 20:40:35 +00:00
< p > The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
2019-03-01 09:02:33 +00:00
items like a regular web page, but all the others as an AJAX request.< / p >
2018-02-22 18:22:22 +00:00
< / div >
< / div >
< footer >
< div class = "rst-footer-buttons" role = "navigation" aria-label = "footer navigation" >
2019-03-01 09:02:33 +00:00
< a href = "../02_Concept_of_LinkHandler/" class = "btn btn-neutral float-right" title = "Concept of the LinkHandler" > Next < span class = "icon icon-circle-arrow-right" > < / span > < / a >
2018-09-01 13:48:12 +00:00
2018-02-22 18:22:22 +00:00
2019-03-01 09:02:33 +00:00
< a href = "../00_Prepare_everything/" class = "btn btn-neutral" title = "Before You Start" > < span class = "icon icon-circle-arrow-left" > < / span > Previous< / a >
2018-02-22 18:22:22 +00:00
< / div >
< hr / >
< div role = "contentinfo" >
<!-- Copyright etc -->
< / div >
Built with < a href = "http://www.mkdocs.org" > MkDocs< / a > using a < a href = "https://github.com/snide/sphinx_rtd_theme" > theme< / a > provided by < a href = "https://readthedocs.org" > Read the Docs< / a > .
< / footer >
< / div >
< / div >
< / section >
< / div >
< div class = "rst-versions" role = "note" style = "cursor: pointer" >
< span class = "rst-current-version" data-toggle = "rst-current-version" >
< span > < a href = "../00_Prepare_everything/" style = "color: #fcfcfc;" > « Previous< / a > < / span >
2018-09-01 13:48:12 +00:00
< span style = "margin-left: 15px" > < a href = "../02_Concept_of_LinkHandler/" style = "color: #fcfcfc" > Next » < / a > < / span >
2018-02-22 18:22:22 +00:00
< / span >
< / div >
< script > var base _url = '..' ; < / script >
2018-09-01 13:48:12 +00:00
< script src = "../js/theme.js" defer > < / script >
< script src = "../search/main.js" defer > < / script >
2018-02-22 18:22:22 +00:00
< / body >
< / html >