Deployed da41b6d
with MkDocs version: 1.0.4
|
@ -153,10 +153,9 @@
|
|||
<div class="section">
|
||||
|
||||
<h1 id="before-you-start">Before You Start</h1>
|
||||
<p>These documents will guide you through the process of creating your own Extractor
|
||||
service of which will enable NewPipe to access additional streaming services, such as the currently supported YouTube and SoundCloud.
|
||||
The whole documentation consists of this page, which explains the general concept of the NewPipeExtractor, as well as our
|
||||
<a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/">Jdoc</a> setup.</p>
|
||||
<p>These documents will guide you through the process of understanding or creating your own Extractor
|
||||
service of which will enable NewPipe to access additional streaming services, such as the currently supported YouTube, SoundCloud and MediaCCC.
|
||||
The whole documentation consists of this page and <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/">Jdoc</a> setup, which explains the general concept of the NewPipeExtractor.</p>
|
||||
<p><strong>IMPORTANT!!!</strong> This is likely to be the worst documentation you have ever read, so do not hesitate to
|
||||
<a href="https://github.com/teamnewpipe/documentation/issues">report</a> if
|
||||
you find any spelling errors, incomplete parts or you simply don't understand something. We are an open community
|
||||
|
|
|
@ -75,7 +75,7 @@
|
|||
|
||||
<li><a class="toctree-l3" href="#collectorextractor-pattern-for-lists">Collector/Extractor Pattern for Lists</a></li>
|
||||
|
||||
<li><a class="toctree-l3" href="#infoitems-encapsulated-in-pages">InfoItems Encapsulated in Pages</a></li>
|
||||
<li><a class="toctree-l3" href="#listextractor">ListExtractor</a></li>
|
||||
|
||||
</ul>
|
||||
|
||||
|
@ -196,15 +196,16 @@ try {
|
|||
<h2 id="collectorextractor-pattern-for-lists">Collector/Extractor Pattern for Lists</h2>
|
||||
<p>Information can be represented as a list. In NewPipe, a list is represented by a
|
||||
<a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html">InfoItemsCollector</a>.
|
||||
A InfoItemCollector will collect and assemble a list of <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html">InfoItem</a>.
|
||||
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemCollector via <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-">commit()</a>.</p>
|
||||
A InfoItemsCollector will collect and assemble a list of <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html">InfoItem</a>.
|
||||
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemsCollector via <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-">commit()</a>.</p>
|
||||
<p><img alt="InfoItemsCollector_objectdiagram.svg" src="../img/InfoItemsCollector_objectdiagram.svg" /></p>
|
||||
<p>If you are implementing a list for your service you need to extend InfoItem containing the extracted information
|
||||
and implement an <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html">InfoItemExtractor</a>,
|
||||
that will return the data of one InfoItem.</p>
|
||||
<p>If you are implementing a list in your service you need to implement an <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html">InfoItemExtractor</a>,
|
||||
that will be able to retreve data for one and only one InfoItem. This extractor will then be <em>comitted</em> to the <strong>InfoItemsCollector</strong> that can collect the type of InfoItems you want to generate.</p>
|
||||
<p>A common implementation would look like this:</p>
|
||||
<pre><code>private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
|
||||
MyInfoItemCollector collector = new MyInfoItemCollector(getServiceId());
|
||||
<pre><code>private SomeInfoItemCollector collectInfoItemsFromElement(Element e) {
|
||||
// See *Some* as something like Stream or Channel
|
||||
// e.g. StreamInfoItemsCollector, and ChannelInfoItemsCollector are provided by NP
|
||||
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
|
||||
|
||||
for(final Element li : element.children()) {
|
||||
collector.commit(new InfoItemExtractor() {
|
||||
|
@ -225,15 +226,21 @@ that will return the data of one InfoItem.</p>
|
|||
|
||||
</code></pre>
|
||||
|
||||
<h2 id="infoitems-encapsulated-in-pages">InfoItems Encapsulated in Pages</h2>
|
||||
<h2 id="listextractor">ListExtractor</h2>
|
||||
<p>There is more to know about lists:</p>
|
||||
<ol>
|
||||
<li>
|
||||
<p>When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
|
||||
and its creator. Such info can be called <strong>list header</strong>.</p>
|
||||
<p>When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down. </p>
|
||||
<p>This is why a list in NewPipe lists are chopped down into smaller lists called <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html">InfoItemsPage</a>s. Each page has its own URL, and needs to be extracted separately.</p>
|
||||
<p>Additional metadata about the list and extracting multiple pages can be handled by a
|
||||
<a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html">ListExtractor</a>,
|
||||
and its <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html">ListExtractor.InfoItemsPage</a>.</p>
|
||||
<p>For extracting list header information it behaves like a regular extractor. For handling <code>InfoItemsPages</code> it adds methods
|
||||
</li>
|
||||
<li>
|
||||
<p>When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.</p>
|
||||
</li>
|
||||
</ol>
|
||||
<p>Both of these Problems are fixed by the <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html">ListExtractor</a> which takes care about extracting additional metadata about the liast,
|
||||
and by chopping down lists into several pages, so called <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html">InfoItemsPage</a>s.
|
||||
Each page has its own URL, and needs to be extracted separately.</p>
|
||||
<p>For extracting list header information a <code>ListExtractor</code> behaves like a regular extractor. For handling <code>InfoItemsPages</code> it adds methods
|
||||
such as:</p>
|
||||
<ul>
|
||||
<li><a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--">getInitialPage()</a>
|
||||
|
@ -245,6 +252,46 @@ such as:</p>
|
|||
</ul>
|
||||
<p>The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
|
||||
items like a regular web page, but all the others as an AJAX request.</p>
|
||||
<p>An InfoItemsPage itself has two constructors which take these parameters:
|
||||
- The <strong>InfoitemsCollector</strong> of the list that the page should represent
|
||||
- A <strong>nextPageUrl</strong> which represents the url of the following page (may be null if not page follows).
|
||||
- Optionally <strong>errors</strong> which is a list of Exceptions that may have happened during extracton.</p>
|
||||
<p>Here is a simplified reference implementation of a list extractor that only extracts pages, but not metadata:</p>
|
||||
<pre><code>class MyListExtractor extends ListExtractor {
|
||||
...
|
||||
private Document document;
|
||||
|
||||
...
|
||||
|
||||
public InfoItemsPage<SomeInfoItem> getPage(pageUrl)
|
||||
throws ExtractionException {
|
||||
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
|
||||
document = myFunctionToGetThePageHTMLWhatever(pageUrl);
|
||||
|
||||
//remember this part from the simple list extraction
|
||||
for(final Element li : document.children()) {
|
||||
collector.commit(new InfoItemExtractor() {
|
||||
@Override
|
||||
public String getName() throws ParsingException {
|
||||
...
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getUrl() throws ParsingException {
|
||||
...
|
||||
}
|
||||
...
|
||||
}
|
||||
return new InfoItemsPage<SomeInfoItem>(collector, myFunctionToGetTheNextPageUrl(document));
|
||||
}
|
||||
|
||||
public InfoItemsPage<SomeInfoItem> getInitialPage() {
|
||||
//document here got initialzied by the fetch() function.
|
||||
return getPage(getTheCurrentPageUrl(document));
|
||||
}
|
||||
...
|
||||
}
|
||||
</code></pre>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
|
Before Width: | Height: | Size: 193 KiB After Width: | Height: | Size: 193 KiB |
Before Width: | Height: | Size: 3.0 KiB After Width: | Height: | Size: 2.9 KiB |
Before Width: | Height: | Size: 94 KiB After Width: | Height: | Size: 94 KiB |
Before Width: | Height: | Size: 156 KiB After Width: | Height: | Size: 156 KiB |
Before Width: | Height: | Size: 4.4 KiB After Width: | Height: | Size: 4.4 KiB |
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
Before Width: | Height: | Size: 4.6 KiB After Width: | Height: | Size: 4.4 KiB |
Before Width: | Height: | Size: 4.9 KiB After Width: | Height: | Size: 4.7 KiB |
Before Width: | Height: | Size: 8.8 KiB After Width: | Height: | Size: 8.8 KiB |
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
Before Width: | Height: | Size: 6.5 KiB After Width: | Height: | Size: 6.3 KiB |
Before Width: | Height: | Size: 35 KiB After Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 50 KiB |
Before Width: | Height: | Size: 9.6 KiB After Width: | Height: | Size: 9.3 KiB |
Before Width: | Height: | Size: 11 KiB After Width: | Height: | Size: 10 KiB |
Before Width: | Height: | Size: 8.4 KiB After Width: | Height: | Size: 8.2 KiB |
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
Before Width: | Height: | Size: 34 KiB After Width: | Height: | Size: 34 KiB |
Before Width: | Height: | Size: 10 KiB After Width: | Height: | Size: 10 KiB |
Before Width: | Height: | Size: 38 KiB After Width: | Height: | Size: 38 KiB |
|
@ -198,5 +198,5 @@ It focuses on making it possible for the creator of a scraper for a streaming se
|
|||
|
||||
<!--
|
||||
MkDocs version : 1.0.4
|
||||
Build Date UTC : 2019-04-07 17:32:18
|
||||
Build Date UTC : 2019-07-02 12:24:36
|
||||
-->
|
||||
|
|
|
@ -2,47 +2,47 @@
|
|||
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
<url>
|
||||
<loc>None</loc>
|
||||
<lastmod>2019-04-07</lastmod>
|
||||
<lastmod>2019-07-02</lastmod>
|
||||
<changefreq>daily</changefreq>
|
||||
</url>
|
||||
</urlset>
|