Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringPool class (ArrayPool<T>-like to cache strings) #3380

Merged

Conversation

Sergio0694
Copy link
Member

@Sergio0694 Sergio0694 commented Jul 7, 2020

PR Type

What kind of change does this PR introduce?

  • Feature

What is the new behavior?

This PR introduces a new StringPool type, inspired by the ArrayPool<T> type from the BCL, which can be used to cache and reuse string instances. It is a similar concept to string interning, with the additional benefit of working with just a ReadOnlySpan<char> as input (so that the string allocation can be skipped entirely if an equal instance exists already), and offers more customizaton compared to string interning. It also allows users to reset the cache if needed, as well as creating multiple, configurable cache instances for different purposes, if necessary. This is the new API surface:

namespace Microsoft.Toolkit.HighPerformance.Buffers
{
    public sealed class StringPool
    {
        StringPool();
        StringPool(int minimumSize);

        static StringPool Shared { get; }

        int Size { get; }

        void Add(string value);
        string GetOrAdd(string value);
        string GetOrAdd(ReadOnlySpan<char> span);
        string GetOrAdd(ReadOnlySpan<byte> span, Encoding encoding);
        bool TryGet(ReadOnlySpan<char> span, out string? value);
        void Reset();
    }
}

Example usage

This sample shows how to parse a domain string from a given URL, using StringPool:

public static string GetHost(string url)
{
    // We assume the input might start either with eg. https:// (or other prefix),
    // or directly with the host name. Furthermore, we also assume that the input
    // URL will always have a '/' character right after the host name.
    // For instance: "https://docs.microsoft.com/dotnet/api/system.string.intern".
    int
        prefixOffset = url.AsSpan().IndexOf(stackalloc char[] { ':', '/', '/' }),
        startIndex = prefixOffset == -1 ? 0 : prefixOffset + 3,
        endIndex = url.AsSpan(startIndex).IndexOf('/');
    // In this example, it would be "docs.microsoft.com"
    ReadOnlySpan<char> span = url.AsSpan(startIndex, endIndex);
    return StringPool.Shared.GetOrAdd(span);
}

Benchmarks

These are some benchmark results from here, when parsing a ~25 MB .csv file in UTF8 format with many duplicates. The difference between StackallocGetOrAdd and EmbeddedGetOrAdd is whether the UTF16 conversion is done externally or using the implicit StringPool API. In both cases, the memory usage goes down considerably, and performance is still pretty fast 🚀

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Default 377.2 ms 0.68 ms 0.45 ms 1.00 11000.0000 4000.0000 - 67108800 B
StackallocGetOrAdd 334.6 ms 6.30 ms 4.17 ms 0.89 - - - 224 B
EmbeddedGetOrAdd 416.0 ms 4.51 ms 2.98 ms 1.10 - - - 272 B

PR Checklist

Please check if your PR fulfills the following requirements:

  • Tested code with current supported SDKs
  • Pull Request has been submitted to the documentation repository instructions. Link: #366
  • Sample in sample app has been added / updated (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features) (if applicable)
  • Header has been added to all new source files (run build/UpdateHeaders.bat)
  • Contains NO breaking changes

@ghost
Copy link

ghost commented Jul 7, 2020

Thanks Sergio0694 for opening a Pull Request! The reviewers will test the PR and highlight if there is any conflict or changes required. If the PR is approved we will proceed to merge the pull request 🙌

@ghost ghost assigned michael-hawker Jul 7, 2020
@Sergio0694 Sergio0694 added high-performance 🚂 Issues/PRs for the Microsoft.Toolkit.HighPerformance package optimization ☄ Performance or memory usage improvements labels Jul 7, 2020
@ghost
Copy link

ghost commented Jul 7, 2020

This PR has been marked as "needs attention 👋" and awaiting a response from the team.

@Sergio0694 Sergio0694 removed the optimization ☄ Performance or memory usage improvements label Jul 8, 2020
@Sergio0694
Copy link
Member Author

Sergio0694 commented Sep 14, 2020

This PR should be at 100% code coverage now, I've added some more tests in the last commit 😊

image

NOTE: that 1% left is just a line with a closing bracket being incorrectly counted for some reason.

@Sergio0694
Copy link
Member Author

@sonnemaf have you tested this out from the CI build as well yet, would love any feedback you have since you indicated this would be useful to you.

Hey @sonnemaf - just a small ping in case you missed the previous message from @michael-hawker here. Did you have a chance to test the final (current) implementation of StringPool from the CI? Any other feedbacks on this PR? Thanks! 😊

@michael-hawker
Copy link
Member

If @azchohfi or @sonnemaf can sign-off on this, I'm good. 🙂

@sonnemaf
Copy link
Contributor

I have tested it with my small test projects and found no issues. I think it is good to go. Signing it off.

Can't wait to use it in my production code.

@Sergio0694
Copy link
Member Author

Hey @sonnemaf that's great to hear, awesome! 😄

Looks like the CI failed due to the Azure outage from yesterday (perfect timing!), but I don't have the option to re-trigger the pipeline myself, @azchohfi would you mind just bumping that so we can check that everything looks good here? Thanks! 😊

@azchohfi
Copy link
Contributor

Done.

@Sergio0694
Copy link
Member Author

anakin

@azchohfi
Copy link
Contributor

Absolutely, young padawan!
image

@Sergio0694
Copy link
Member Author

Oof was looking in the wrong place, yeah I see that now! Thank you! 😊

@Sergio0694
Copy link
Member Author

@michael-hawker Small ping in case you missed the notification, Fons approved this PR in his comment above (here), let me know if this is alright to merge now or if you'd like to have another look at it first. Excited to have this be part of the Preview 3 😄🚀

@ghost
Copy link

ghost commented Oct 6, 2020

Hello @michael-hawker!

Because this pull request has the auto merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@michael-hawker michael-hawker merged commit b20befb into CommunityToolkit:master Oct 6, 2020
@Sergio0694
Copy link
Member Author

Awesome! 🎉🎉🎉

Thank you @sonnemaf for the review and feedbacks, and @michael-hawker for the approval and merge! 😄

@sonnemaf
Copy link
Contributor

sonnemaf commented Oct 8, 2020

I just saw that the StringPool is not mentioned in the description of the preview 3 NuGet package.

image

@Sergio0694
Copy link
Member Author

@sonnemaf Nice catch! Fixed that in aef54c6 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merge ⚡ feature 💡 high-performance 🚂 Issues/PRs for the Microsoft.Toolkit.HighPerformance package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants